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METHOD AND APPARATUS FOR 
BALANCING THE PROCESS LOAD ON 
NETWORK SERVERS ACCORDING TO 
NETWORK AND SERVE BASED POLICIES 

CROSS REFERENCE TO RELATED 
APPLICATIONS 

This application claims priority from the following U.S. 
Provisional Application, the disclosure of which, including 
all appendices and all attached documents, is incorporated 
by reference in its entirety for all purposes: 

U.S. Provisional patent application Sen No. 60/032,484, 
Rodney L. Joffe, et. al., entitled, "A Distributed Computing 
System and Method for Distributing User Requests to Rep- 
licated Network Servers", filed Dec. 9, 1996. 

COPYRIGHT NOTICE 

A portion of the disclosure of this patent document 
contains material which is subject to copyright protection. 
The copyright owner has no objection to the facsimile 
reproduction by anyone of the patent document or the patent 
disclosure as it appears in the Patent and Trademark Office 
patent file or records, but otherwise reserves all copyright 
rights whatsoever. 

BACKGROUND OF THE INVENTION 

The present invention relates generally to the field of 
distributed computer systems, and more specifically to com- 
puting systems and methods for assigning requests to one of 
a multiplicity of network servers based upon best criteria 
such as the speed of the underlying network infrastructure. 

The explosive growth of the World Wide Web needs little 
introduction. Not only are members of the technical com- 
munity finding an ever greater number of technical and 
informational resources available on the World Wide Web, 
but also the mainstream populace is finding favorite 
restaurants, car makes and churches sporting new websites. 
The popularity of the World Wide Web as a communications 
medium lies in the richness of its information content and 
ease of use. Information in this medium exists as objects in 
a widely distributed collection of internetworked servers, 
each object uniquely addressable by its own uniform 
resource locator (URL). Since its inception, the World Wide 
Web has achieved a global prominence in everyday life and 
commerce. 

Yet this explosive growth has not been had without 
difficulty. The proliferation of commercial applications 
brings with it an ever increasing number of users making 
ever increasing numbers of inquiries. The problems of 
latency and bandwidth constraints manifest themselves in 
delay, lost information and the distraught customers. 

Network architects respond using an array of solutions. 
Many responses fall within the category of solutions based 
upon supplying more computing power. This may encom- 
pass such alternatives as different web server software, web 
server hardware or platform, increases in RAM or CPU in 
the server, application rewrites, or increasing network band- 
width by upgrading hardware. Another class of solutions 
involves using multiple servers or locating servers strategi- 
cally. One method in this class is to locate the server at the 
internet service provider. By selecting a service provider 
with a optimal pairing capability, colocating the server at the 
service provider's site can yield a much better connection to 
the rest of the internet Another approach is the use of 
distributed servers. Place identical content servers at strate- 



gic locations around the world. For example, one in New 
York, one in San Francisco, and one in London. This 
distributes the load to multiple servers and keeps traffic 
closer to the requester. Another approach is to cluster 

5 servers. Clustering enables sharing of hard drive arrays 
across multiple servers. Another approach is the server farm. 
This entails the use of multiple webservers with identical 
content, or the segmentation based upon functionality. For 
example, two servers for web functions, two for FTP, two as 

iQ a database and so forth. A variation on the server farm is the 
distributed server farm. This places server farms at strategic 
locations — essentially combining the server farm with the 
distributed server approach. 

The multiple and distributed server approaches solve one 
problem at the expense of creating another. If there are 

15 multiple servers, how does the end user locate your site? 
Presently, names and universal resource locators (URLs) are 
resolved into unique single addresses by a domain name 
service (DNS). DNS servers maintain a list of domain names 
cross referenced to individual IP addresses. However, if 

20 multiple web servers or server farms are used, the DNS 
system must be modified. A common approach to this 
problem is to modify the DNS system to a one to many 
mapping of names to IP addresses. Thus the DNS will return 
a list of IP addresses for any particular web object. These 

25 may then be handed out to the various clients in a round- 
robin fashion. There are, however, several drawbacks to this 
approach. The round-robin paradigm returns IP addresses in 
a strict order with little regard to the location of the requester 
or the server. The scheme has no knowledge of server 

30 architecture or loading. The selection simply progresses 
down a simple list. One server may receive all heavy duty 
users. Additionally the weakest link determines the overall 
performance, so server platforms need to be kept in rela- 
tively parity. Another problem is that the DNS simply 

35 returns IP addresses with no regard as to whether the server 
to which the address corresponds is operational. 
Consequently, if one of the round-robin servers happens to 
be off-line for maintenance, DNS continues to give out the 
address, and potential users continue to receive time-out 

w error responses. Thus, the round-robin modification of DNS 
makes a broad attempt to solve the distributed server prob- 
lem. However, there is no regard to network traffic, low 
balancing among servers or reliability issues. 

Several products on the market purport to address these 

45 problems, but these prior efforts all suffer the drawback that 
they require that the user software environment be modified 
in order to facilitate replicated server selection. A scheme 
that requires user software modifications is less desirable 
due to the practical problems of ensuring widespread sofl- 

50 ware distribution. Such schemes are, at best, useful as 
optimization techniques. 

One such class of approaches are those that rely on 
explicit user selection to assign a user request to a server. 
The user application may include additional steps that 

55 require the user to have sufficient knowledge, sophistication, 
and patience to make their own server selection. Such 
schemes are not always desirable for a number of reasons. 

A technique based on selective host routing uses multiple 
replicated servers, all with the same network address, 

60 located at different points in the network topology. A router 
associated with each server captures incoming network 
traffic to the shared server address, and forwards the traffic 
to the specific server. This technique can only statically 
distribute the client request load to the nearby server, with no 

65 consideration of server load or other network characteristics. 
The BIND implementation of the Domain Name System 
server may include techniques to bind server names to 
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different network addresses, where one of a set of different most available free RAM; 3) the most available free SWAP 
multiple addresses are assigned either sequentially or ran- (virtual memory); 4) the highest amount of CPU idle time; 
domly. Service providers assign a different address to each or 5) the fastest ICMP route to the client's machine, 
replicated server, and BIND directs user requests to the Advantages of the approaches according to the invention 
alternative servers. This technique can only statically dis- 5 are increased tolerance of faults occurring in the underlying 
tribute the client request load to an arbitrary server, with no hardware and reliability over prior art web servers. The 
consideration of server load or other network characteristics. invention will be better understood upon reference to the 
SONAR is an emerging IETF (Internet Engineering Task following detailed description and its accompanying draw- 
Force) protocol for distributing network characteristics, in ings. 
particular, for topological closeness. SONAR includes a data 10 

format for representing query requests and responses, but BRIEF DESCRIPTION OF THE DRAWINGS 

does not specify a mechanism for determining the network pj G ^dep^ a representative client server relationship 

characteristics. m accordance with a particular embodiment of the inven- 

The Cisco Local Director is a product that works as a tio D ; 
network traffic multiplexor that sits in front of multiple local 15 ^ G 1B d ids , hsK£oaal perspective 0 f the represen- 
servers and distributes new ttansport connections to each ^ c|i6nt relalionship in acc0 rdance with a par- 
server based on the amount of traffic flowing to the servers. ticular embodiment of the invention; 
This product does not consider network characteristics in its . . 

decision, and further requires that the replicated servers be , 1C d ^ s a ^.™*ive mtemetworlang envi- 

collocated. Cisco Systems is a company headquartered in ™ ronment in accordance with a embodiment of the 

San Jose, Calif. invention; 

The Cisco Distributed Director redirects user requests to ™ p 1D depi f a relationsbi P dia S ram of the l ^ of the 

topological^ distant servers based on information obtained TCP/IP protocol suite; 

from network routing protocols. The Distributed Director FIG. 2 A depicts a distributed computing environment in 

intercepts either incoming DNS requests or HTTP requests, 25 accordance with a particular embodiment of the invention; 

and provides the appropriate response for redirection. This FIG. 2B depicts a distributed computing environment in 

product does not consider server load, and only considers the accordance with an alternative embodiment of the invention; 

restricted set of information available from routing proto- pic. 3A depicts the relationship of processes in accor- 

cols; this information is also limited in accuracy by the dance with a representative embodiment of the invention; 

aggregation techniques required to enable scalable internet ^ 3B dcpicts ^ rclationship of proccsses ^ accor . 

routing. dance with an alternative embodiment of the invention; 

Although these products, taken together, consider server p, G 3C d icts , he relationship of processes in accor- 

load and network characteristics, they do not make an ^ ^ , feraWe embodiment of me invention; 

integrated server selection. Yet, for all these efforts, the A 

pundits' criticism still rings true, it is still the "world wide 35 . \ 4A P ro f ste P s m accordance with a 

wait." For this reason, what is needed is a system which embodimenl of * e 

automatically selects an appropriate server from which to FIG. 4B depicts process steps in accordance with an 

retrieve a data object for a user based upon the user's alternative embodiment of the invention; 

request, and the capabilities and topology of the underlying ^ FIG. 4C depicts process steps in accordance with a 

network. preferable embodiment of the invention; and 

FIGS. 5A-5C depict flow charts of the optimization 

SUMMARY OF THE INVENTION process within a director component according to a particu- 

The present invention provides the ability to assign lar embodiment of the invention, 

requests for data objects made by clients among multiple 4S nF<5PRTPTlON OP THF <5PFP1FTP 

network servers. The invention provides a distributed com- ^V™^w™™ 

puting system and methods to assign user requests to rep- EMBODIMENTS 

licated servers contained by the distributed computing sys- 1.0 Introduction 

tem in a manner that attempts to meet the goals of a a preferable embodiment of a server load balancing 

particular routing policy. Policies may include minimizing 50 system according to the invention has been reduced to 

the amount of time for the request to be completed. For practice and will be made available under the trade name 

example, a system according to the invention may be "HOPSCOTCHT™ 

configured to serve data objects to users according to the A word about nomenclature is in order. Systems accord- 
shortest available network path. mg t0 the pre sent invention comprise a multiplicity of 

Specifically, the invention provides a system for routing 55 processes which may exist in alternative embodiments on 

requests for data objects from any number of clients based any of a multiplicity of computers in a distributed networked 

upon a "best server" routing policy to one of multiple environment or as concurrent processes running in virtual 

content servers. Content servers serve data objects respon- machines or address spaces on the same computer. To limit 

sive to clients' requests via one or more network access the exponential growth of names, the following conventions 

points, in accordance with the decision of a director. The 60 have been employed to enhance readability. Individual 

director determines based upon the routing policy the rout- deviations will be noted where they occur. An 'xyz server' 

ing of said requests for data objects to a particular content is a computer or virtual machine housing a collection of 

server. processes which make up xyz. An 'xyz component* is a 

In accordance with a particular aspect of the invention, collection of processes which perform a set of functions 

routing policies may comprise any of the following, a 65 collectively referred to as xyz. An 'xyz* is the set of 

combination of any of the following, or none of the follow- functions being performed by the xyz component on the xyz 

ing: 1) the least number of open TCP connections; 2) the machine. 
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1.1 Hardware Overview The file storage subsystem provides persistent (non- 

The distributed computing system for server load balanc- volatile) storage for program and data files, and typically 

ing (the "system* 1 ) of the present invention is implemented includes at least one hard disk drive and at least one floppy 

in the Perl programming language and is operational on a disk drive (with associated removable media). There may 

computer system such as shown in FIG. 1A. This invention 5 also be other devices such as a CD-ROM drive and optical 

may be implemented in a client-server environment, but a drives (all with their associate removable media), 

client-server environment is not essential. FIG. 1A shows a Additionally, the computer system may include drives of the 

conventional client-server computer system which includes type with removable media cartridges. The removable media 

a server 20 and numerous clients, one of which is shown as cartridges may, for example be hard disk cartridges, such as 

client 25. The .use ofthetemi "server" is used in the context 1Q ^ marketed b s and om afld flexiWe ^ 

?l? e }f ventl0n > w 1 l ? ere t m } he *™* ^ves quenes from camid ^ „ mose marketed b j 0ne or more 

(typically remote) clients, does substantially ^all the process- of ^ ^ be located fl ^ ^ such 

mg necessary to formulate responses to the queries, and . J . ^ . . 

.j . ^ 1T ^ -„ a server on a local area network or at a site of the Internet s 

provides these responses to the clients. However, server 20 \Vorld Wide Web 
may itself act in the capacity of a client when it accesses 

remote databases located at another node acting as a data- 15 In ^ context, the term "bus subsystem" is used generi- 

base server ca * lv 50 ^ t0 mcm de an Y mechanism for letting the various 

The hardware configurations are in general standard and comrwnente and subsystems communicate with each other 

will be described only briefly. In accordance with known " m t tend f * W f ±e exce P Uon of th * devices and the 

praaice,server20includesoneormoreproce S sors30which 9ft display the other components need not be at the same 

communicate with a number of peripheral devices via a bus 20 P h y sical locaUon for exam P le > P ortlon f of t ** file 

subsystem 32. These peripheral devices typically include a st0 ' a S e svstem could be connected via various local-area or 

storage subsystem 35, comprised of memory subsystem 35a wide-area network media, including telephone lines, 

and file storage subsystem 356, which hold computer pro- Similarly, the input devices and display need not be at the 

grams (e.g., code or instructions) and data, set of user „ *™ e locaUon 85 me Pressor, although it is anticipated that 

interface input and output devices 37, and an interface to 25 P^nt myention wiU most often be implemented in the 

outside networks, which may employ Ethernet, Token Ring, context of PCs and workstations. 

ATM, IEEE 802.3, ITU X.25, Serial Link Internet Protocol Bus subsystem 32 is shown schematically as a single bus, 

(SLIP) or the public switched telephone network. This but a typical system has a number of buses such as a local 

interface is shown schematically as a "Network Interface" „ bus and one or more expansion buses (e.g., ADB, SCSI, ISA, 

block 40. It is coupled to corresponding interface devices in EISA, MCA, NuBus, or PCI), as well as serial and parallel 

client computers via a network connection 45. P orts - Network connections are usually established through 

Client 25 has the same general configuration, although a device such ™ a network adapter on one of these expansion 

typically with less storage and processing capability. Thus, buses or a modem on a serial port. The client computer may 

while the client computer could be a terminal or a low-end 35 be a desktop system or a portable system, 

personal computer, the server computer is generally a high- The user interacts with the system using interface devices 

end workstation or mainframe, such as a SUN SPARC™ 37 (or devices 37 in a standalone system). For example, 

server. Corresponding elements and subsystems in the client client queries are entered via a keyboard, communicated to 

computer are shown with corresponding, but primed, refer- client processor 30', and thence to network interface 40' over 

ence numerals. m bus subsystem 32'. The query is then communicated to 

The user interface input devices typically includes a server 20 via network connection 45. Similarly, results of the 
keyboard and may further include a pointing device and a query are communicated from the server to the client via 
scanner. The pointing device may be an indirect pointing network connection 45 for output on one of devices 3T (say 
device such as a mouse, trackball, touchpad, or graphics a display or a printer), or may be stored on storage sub- 
tablet, or a direct pointing device such as a touchscreen 45 system 35'. 

incorporated into the display. Other types of user interface FIG. IB is a functional diagram of the computer system 

input devices, such as voice recognition systems, are also of FIG. 1A. FIG. IB depicts a server 20, and a representative 

possible. client 25 of a multiplicity of clients which may interact with 

The user interface output devices typically include a the server 20 via the internet 45 or any other communica- 

printer and a display subsystem, which includes a display 50 tions method. Blocks to the right of the server are indicative 

controller and a display device coupled to the controller. The of the processing components and functions which occur in 

display device may be a cathode ray rube (CRT), a flat-panel the server's program and data storage indicated by block 35a 

device such as a liquid crystal display (LCD), or a projection in FIG. LA A TCP/IP "stack" 44 works in conjunction with 

device. Display controller provides control signals to the Operating System 42 to communicate with processes over a 

display device and normally includes a display memory for 55 network or serial connection attaching Server 20 to internet 

storing the pixels that appear on the display device. The 45. Web server software 46 executes concurrently and 

display subsystem may also provide non- visual display such cooperatively with other processes in server 20 to make data 

as audio output. objects 50 and 51 available to requesting clients. A Common 

The memory subsystem typically includes a number of Gateway Interface (CGI) script 55 enables information from 

memories including a main random access memory (RAM) 60 user clients to be acted upon by web server 46, or other 

for storage of instructions and data during program execu- processes within server 20. Responses to client queries may 

tion and a read only memory (ROM) in which fixed instruc- be returned to the clients in the form of a Hypertext Markup 

tions are stored. In the case of Macintosh-compatible per- Language (HTML) document outputs which are then com- 

sonal computers the ROM would include portions of the municated via internet 45 back to the user, 

operating system; in the case of IBM-compatible personal 65 Client 25 in FIG. IB possesses software implementing 

computers, this would include the BIOS (basic input/output functional processes operatively disposed in its program and 

system). data storage as indicated by block 35a* in FIG. LA TCP/IP 
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stack 44', works in conjunction with Operating System 42* to used to correlate an Internet address and a Media Access 
communicate with processes over a network or serial con- Address (MAC) of a particular host. The Routing Informa- 
nection attaching Client 25 to internet 45. Software imple- tion Protocol (RIP) is a dynamic routing protocol for passing 
meriting the function of a web browser 46' executes con- routing information between hosts on networks. The Internet 
currently and cooperatively with other processes in client 25 5 Control Message Protocol (ICMP) is an internal protocol for 
to make requests of server 20 for data objects 50 and 51. The passing control messages between hosts on various net- 
user of the client may interact via the web browser 46' to works - ICMP messages provide feedback about events in the 
make such queries of the server 20 via internet 45 and to network environment or can help determine if a path exists 
view responses from the server 20 via internet 45 on the web 10 » ^^j^ * tbc oetwork env * onmeDt - The latter is 
browser 46* 10 called a * Internet Protocol (IP) provides the basic 

mechanism for routing packets of information in the Inter- 

1.2 Network Overview net. IP is a non-reliable communication protocol. It provides 

T7ir« in- ii t r.u • * t* c i i a "best efforts* 1 delivery service and does not commit net- 

FIG. 1C is illustrative of the internetworking of a plurality „, , ♦ i ♦ , 

c . , , f „„„ i A e , * , 7 work resources to a particular transaction, nor does it 

of clients such as client 25 of FIGS. 1A and IB and a , r * • • t ij . 

. . . , 7, 7^77 77 , ^ 15 perform retransmissions or give acknowledgments. 

^ £f^™u f^?^?"^ 18 ^transportla y erprotocols86provideend-to-endtrans- 

as described herein above. In FIG. 1C, a network 60 is an services across multiple heterogenous networks. H,e 

CXa T P iA i t ^ ^ ° r n^SS^o^ Uscr Data S ram Protoco1 (^P) P rovides a connectionless, 

work 60 links a host 61, such as an IBM RS6000 RISC datagram oricntcd wh ^ providcs a non . rcUable 

workstation, which may be running the AIX operating 2 o delivery mechanism for streams of information. The Trans- 
system, to a host 62, which is a personal computer, which mission Control Protocol (TCP) provides a reliable session- 
may be running Windows 95, IBM OS/2 or a DOS operating based service for delivery of sequenced packets of informa- 
system, and a host 63, which may be an IBM AS/400 tion across the Internet. TCP provides a connection oriented 
computer, which may be running the OS/400 operating reliable mechanism for information delivery, 
system. Network 60 is internetworked to a network 70 via a 25 The session, or application layer 88 provides a list of 
system gateway which is depicted here as router 75, but network applications and utilities, a few of which are 
which may also be a gateway having a firewall or a network illustrated here. For example, File Transfer Protocol (FTP) is 
bridge. Network 70 is an example of an Ethernet network a standard TCP/IP protocol for transferring files from one 
that interconnects a host 71, which is a SPARC workstation, machine to another. FTP clients establish sessions through 
which may be running SUNOS operating system with a host 30 TC p connections with FTP servers in order to obtain files. 
72, which may be a Digital Equipment VAX6000 computer Telnet is a standard TCP/IP protocol for remote terminal 
which may be running the VMS operating system. connection. A Telnet client acts as a terminal emulator and 

Router 75 is a network access point (NAP) of network 70 establishes a connection using TCP as the transport mecha- 

and network 60. Router 75 employs a Token Ring adapter nism ^ a Telnel The Simple Network Management 

and Ethernet adapter. This enables router 75 to interface with 35 Protocol (SNMP) is a standard for managing TCP/IP net- 

the two heterogeneous networks. Router 75 is also aware of works * SNMP ta sks, called "agents", monitor network status 

the Internetwork Protocols, such as ICMP ARP and RIP, parameters and transmit these status parameters to SNMP 

which are described below. ^sks called "manage rs." Managers track the status of asso- 
ciated networks. A Remote Procedure Call (RPC) is a 

FIG. ID is illustrative of the constituents of the Trans- 40 programming interface which enables programs to invoke 
mission Control Protocol/Internet Protocol (TCP/IP) proto- remo te functions on server machines. The Hypertext Trans- 
col suite. The base layer of the TCP/IP protocol suite is the fer Protocol (HTTP) facilitates the transfer of data objects 
physical layer 80, which defines the mechanical, electrical, across networks via a system of uniform resource indicators 
functional and procedural standards for the physical trans- (URI). 

mission of data over communications ^ media such as, for 45 ^ Hypertext Transfer Protocol is a simple protocol built 

example the network connection 45 of FIG. 1A. The on top of Transmission Control Protocol (TCP). The HTTP 

physical layer may comprise electrical, mechanical or fuoc- provides ft method for ^ to obtain data objects from 

tional standards such as whether a network is packet switch- various hosts acting M WTVm OQ the Intemet User sts 

mg or frame-switching; or whether a network is based on a for data objects m made b means of an QEJ 

Carrier Sense Multiple Access/Collision Detection (CSMA/ so requesL A GEX request as depicted comprises x) „ 

CD) or a frame relay paradigm. HTTP header of the format "http://"; followed by 2) an 

Overlying the physical layer is the data link layer 82. The identifier of the server on which the data object resides; 

data link layer provides the function and protocols to trans- followed by 3) the full path of the data object; followed by 

fer data between network resources and to detect errors that 4) the name of the data object. In the GET request shown 

may occur at the physical layer. Operating modes at the 55 below, a request is being made of the server "www.w3.org" 

datalink layer comprise such standardized network topolo- f° r the data object with a path name of "/pub/" and a name 

gies as IEEE 802.3 Ethernet, IEEE 802.5 Token Ring, ITU of "MyData.html": 

X.25, or serial (SLIP) protocols. GET http://www.w3.org/pub/MyEHta.html (1) 

Network layer protocols 84 overlay the datalink layer and 60 Processing of a GET request entails the establishing of a 

provide the means for establishing connections between TCP/IP connection with the server named in the GET 

networks. The standards of network layer protocols provide request and receipt from the server of the data object 

operational control procedures for internetworking commu- specified. After receiving and interpreting a request 

nications and routing information through multiple heterog- message, a server responds in the form of an HTTP 

enous networks. Examples of network layer protocols are 65 RESPONSE message. 

the Internet Protocol (IP) and the Internet Control Message Response messages begin with a status line comprising a 

Protocol (ICMP). The Address Resolution Protocol (ARP) is protocol version followed by a numeric Status Code and an 
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associated textual Reason Phrase. These elements are sepa- Director component, a Ping Manager component, and a 

rated by space characters. The format of a status line is Load Manager component, each of which will be described 

depicted in line (2): hereinbelow. 

Status-line-HTTP- Version Status-Code Reason-Phrase (2) One or more Content Servers 232, 234, 236 and 238 

™ , . .. , , . . . ,5 perform the actual serving of content, i.e., Web pages or Kl'P 

The status line always begins with a protocol version and ~ , , . . f ... / f r 

status code, e.g., "HTTP/1^200". Hie status code element * lcs ; E f h has «oaatcd with it one or more instances of a 

is a three digifinteger result code of the attempt to under- *™ component capable of serving data such that 

stand and satisfy a prior request message. The reason phrase outgpuig packets have a source address that is selectable by 

is intended to give a short textual description of the status me Director component. 

code. The first digit of the status code defines the class of 10 E ach Content Server has a network alias IP address for 

response. There are five categories for the first digit. 1XX is e *ch router controlled by the system. For example, in a 

an information response. It is not currently used. 2XX is a system with four routers, a particular content server will 

successful response, the action was successfully received, have four separate network alias IP addresses. Routing 

understood and accepted. 3XX is a redirection response within the system is configured for "policy routing", i.e., 

indicating that further action must be taken in order to 15 routing based on the source address of packets, so that 

complete the request. It is this response that is used by depending on which network alias IP address the packets are 

certain embodiments of the present invention to cause a from, they will be routed through a specific system router 

client to redirect to a selected server site. 4XX is a client chosen by the Director. A series of selectable IP tunnels 

error response. This indicates a bad syntax in the request. enable the Content Servers to send data to clients using a 

Finally, 5 XX is a server error. This indicates that the server 20 "best route". IP tunnels are configured such that each server 

failed to fulfill an apparently valid request. can serve data out through a chosen network access point. 

Particular formats of HTTP messages are described in, By contrast, systems commonly known in the art serve data 

Crocker, D., "Standard for the Format of ARPA Internet Text routed out of the network through a default peering point, 

Messages", STD 11, RFC 822, UDEL, August 1982, which i.e., network exit points that can route to the other networks 

is incorporated by reference herein for all purposes. 25 which make up the Internet. Typically the default peering 

2.0 Specific Configurations point is the only peering point for the network. Each IP 

FIG. 2 A depicts a representative distributed computing tunnel is configured to send all data to a different peer router, 

system according to the present invention. In FIG. 2 A, a Every IP tunnel starts at one router, and ends at another 

network 200 interconnects a plurality of server machines (remote) router. This enables a content server farm (one or 

with one another and to an external internetworking envi- 30 more content servers in the same physical location, serving 

ronment via a plurality of Network Access Points (NAPs). behind a router) to serve data through a different content 

The topology of this internal network is completely arbitrary server farm's router if the Director has determined this to be 

with respect to the invention. It may be Ethernet, Token the optimal path. 

Ring, Asynchronous Transfer Mode (ATM) or any other Routers are statically configured to send packets down an 

convenient network topology. The Network Access Points 35 IP tunnel when the packets have a source address that is 

are points of connection between large networks, and may associated with that IP tunnel. For example, in a network 

comprise routers, gateways, bridges or other methods of having three peering points the router at a first peering point, 

joining networks in accordance with particular network call it "A", would have two outgoing tunnels, "tunnel 1" and 

topologies. "tunnel 2". These outgoing tunnels lead to the other two 

Network access points 202, 204, 206 and 208 provide 40 routers, each residing at one of the other two peering points, 

network reachability to external networks A, B, C and D for B and C. The router at peering point A would be configured 

communication with client machines via a plurality of so that it routes all packets that have a source address of 

external network paths. In a particular embodiment, these 1.1.1.1 down tunnel 1 and all packets that have a source 

Network Access Points house a portion of the routing address of 2.2.2.2 down tunnel 2. Tunnel 1 sends all packets 

configuration component. In a preferable embodiment, these 45 to a second peering point, B, and tunnel 2 sends all packets 

NAPs are routers 222, 224, 226 and 228 that peer with other to a third peering point, C. This is source-based routing, 

networks and are Open Shortest Path First (OSPF) routing commonly known as "Policy Routing", 

algorithm aware. OSPF is capable of accomodating the case A Content Server associated with the router at peering 

where several machines have exactly the same IP address, by point A runs server software that binds the local side of its 

routing packets to the closest machine. By contrast an 50 server sockets to addresses 1.1.1.1 and 2.2.2.2, enabling it to 

alternative routing mechanism, Routing Information Pro to- serve from either address. The Director software possesses 

col (RIP), would not handle this case. configuration information about this Content Server and its 

Further, these routers employ IP tunneling techniques, as server addresses. This enables the Director to determine that 

are well known to persons of ordinary skill in the art, and are when the Content Server serves from the 1.1.1.1 address, all 

policy routing capable, i.e., able to route packets based in 55 packets are served via tunnel 1 to network access point B, 

part on their source address. Each is configured to use IP and when the Content Server serves from the 2.2.2.2 

tunneling to route packets out of the network through a address, all packets are served via tunnel 2 to network access 

particular router at a particular NAP, based on packet source point C. The Director software is able to select the NAP 

addresses and server availability. through which each Content Server will serve its data by 

FIG. 2A also shows Front End Servers 212, 214, 216 and 60 informing the client which fully qualified domain name, 

218, that are co-located at each NAP. These Front End corresponding to an IP address, to access for service, since 

Servers house a Front End component process. In a prefer- that will be the source address of the request's reply packets, 

able embodiment, the Front End Servers also house an IP The reply packets are automatically routed through the NAP 

Relayer component process. Functions of these processes chosen by the Director software. 

are described hereinbelow. 65 This policy routing configuration is set up once for each 

A Director server 250 houses several software system during installation. The Director has access to a table 

components, which include in the preferable embodiment a of IP addresses for each content server, and the correspond- 
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ing system router for each of the IP addresses of a particular 
content server. This enables the Director to select a Content 
Server IP address that will route through the system router 
of the Director's choice. The Director must first decide 
which Content Server has the least load. It formulates this 
from the data given to it by the Load Manager (which, in 
turn, collected the data from each of the Load Daemons). 
Once the Director has chosen the least-loaded Content 
Server machine, it will choose an address from that 
machine's set of network alias IP addresses that routes 
through a router having the best ICMP one-way trip time to 
the browsing client. It makes this decision based on data 
given to it by the Ping Manager (which in turn collected its 
data from the Ping Daemons). 

FIG. 2B depicts an alternative embodiment of a distrib- 
uted computing system according to the present invention. 
The embodiment of FIG. 2B differs from the embodiment of 
FIG. 2A, primarily in that the embodiment of FIG. 2B does 
not have a separate Director server. Rather, in the alternative 
embodiment of FIG. 2B, the processes which resided on the 
Director server 250 in FIG. 2 A are distributed among the 
front end servers 212, 214, 216 and 218, in FIG. 2B. 

3.0 Specific Processes 

FIG. 3A depicts the process components of a representa- 
tive embodiment according to the present invention. 

3.1 Process Components 

The Front End — An embodiment of a front-end compo- 
nent 360 receives client requests for data objects. This may 
be an incoming HTTP request in the preferable embodiment. 
The front end next solicits "advice" from the director 
components 362, if available, by immediately sending the 
browsing client's IP address to the Director, and waiting for 
the Director to select the "best" server. Finally, it sends a 
reply to the requesting client that directs the client to contact 
a specific server for further request processing. The front- 
end must understand the protocol of the client request, and 
will use application-specific mechanisms to direct the client 
to the specific server. In a particular embodiment, the front 
end sends the browser client an HTTP redirection response 
to the best server's URL. An embodiment of this invention 
may comprise multiple front-end components that under- 
stand one or more user-level protocols. 

The Director — An embodiment of a director component 
362 receives data queries from front-end components 360 
and, using data about the source of the client, preferably, the 
browsing client's IP address, as well as replicated server 
status and network path characteristics, preferably ICMP 
echo response times, received from the collector 
components, such as the Ping Manager 364 and the Load 
Manager 366, returns information that enables front-ends to 
direct user requests. The decision takes all data into account 
and sends the front end the IP address of the "best" server. 
Director components include the decision methods to evalu- 
ate alternative server selections, and provide coordination 
for the entire system. An embodiment of this invention 
comprises one or more director components. 

The Collector Components — An embodiment of a collec- 
tor component monitors one or more characteristics of 
network paths or content server load, and make this infor- 
mation available to the director component for use in the 
server selection decision. An embodiment of this invention 
comprises one or more collector components, preferably, a 
Ping Manager Component 364, a Ping Daemon Component 
(not shown), a Load Manager Component 366 and a Load 
Daemon Component (not shown). 

The Ping Manager tells the Director which content server 
has the fastest ICMP echo path. This data is collected by the 
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Ping Manager 364. The Ping Manager receives its ping- time 
data from individual server machines which are each using 
ICMP pings to determine the ICMP routing time between 
themselves and the browsing client's machine. The Ping 
5 Manager then stores this information and reports it to the 
Director, which uses it to make decisions about the best 
route. 

The Ping Daemon executes on server machines associated 
with each content server machine chaster. A content server 
machine (or cluster of them) resides near each of the 
Network Access Points. The Ping Daemon waits for a ping 
request (and its corresponding IP address, which is the 
browsing client's IP address) and then pings the browsing 
client's IP address to record the ICMP routing time through 
its own closest border router. It then sends this data back to 

15 the Ping Manager. 

The Load Manager software is similar to the Ping 
Manager, but reports and stores information from the Load 
Daemon about each of the content server machines' current 
load It forwards this data to the Director as well. 

20 The Load Daemon runs in conjunction with each content 
server 368 and reports back to the Load Manager periodi- 
cally. It sends data about the number of currently open TCP 
connections, free RAM, free SWAP, and CPU idle time. 
FIG. 3B depicts the software components of an alternative 

25 embodiment according to the present invention. Comparing 
software component diagram of alternative embodiment of 
FIG. 3B with that of the embodiment in FIG. 3A, the main 
difference between the two embodiments is that in the 
alternative embodiment depicted by FIG. 3B, the Director 
process 362 is distributed among various servers. Thus, in 
FIG. 3B the Director process 363 is shown as three separate 
instances of a director process. Whereas, in FIG. 3A the 
Director process 362 is shown as a singular Director process 
with interfaces to other processes on multiple servers. 
FIG. 3C depicts the software components of a preferable 

35 embodiment according to the present invention. Comparing 
the software component diagram of the embodiment of FIG. 
3B with that of the embodiment in FIG. 3C, the main 
difference between the two embodiments is that in the 
embodiment depicted by FIG. 3C, the Front End process 360 

40 includes an IP Relayer function. 

The Front End/IP Relayer — An embodiment of a front- 
end/IP Relayer component 360 receives incomming IP pack- 
ets for any IP traffic. As in the embodiments of FIGS. 3A and 
3B, the front end next solicits "advice" from the director 

45 components 362, if available, by immediately sending the 
client's IP address to the Director, and waiting for the 
Director to select the "best" server. However, rather than 
directing the client to contact a specific server for further 
request processing, as is performed by the Front End com- 

50 ponents in the embodiments of FIGS. 3A and 3B, the IP 
Relayer forwards the packets to the chosen "best server" in 
accordance with the determination made by the Director. An 
embodiment of this invention may comprise multiple front- 
end components functioning at the IP layer. 

55 3.2 Steps to Service a Request 

FIG. 4A depicts a set of steps that occur in the process of 
receiving, evaluating, and answering a client's request in a 
particular embodiment of the invention. In a step 402, a load 
daemon process resident on a content server, such as 232 of 

60 FIG. 2A, is periodically updating information directly to a 
load manager process 366 residing on a Director server 250. 
Subsequently, in a step 404, Load Manager process updates 
load information gathered from all content servers having 
load daemon processes. This enables the Director process 

65 362 to choose the least loaded content server machine 
responsive to an incoming request such as in steps 410 and 
412. 
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la step 410, an incoming client request, which may be an 
HTTP request from for example a Web browser, or an FTP 
request from an FTP client, is routed via an arbitrary external 
path to a system border gateway, typically a router at a 
Network Access Point At this point the client's request 5 
becomes an input to the server load balancing distributed 
computing system of the present invention. It is forwarded 
to a front-end component 360 by IP acting in accordance 
with the routing table associated with a router located at this 
NAP. In step 412, the front-end component, responsive to 10 
the arrival of the client application's request, makes a 
request of the Director component 362 for a preferred server. 
The Director, having received a front-end request for infor- 
mation about client request in step 412, requests 
information, such as the most expedient path between serv- is 
ers and clients, from a collector component, such as a Ping 
Manager 364, in a step 413. In the embodiment of FIG. 4A, 
the most expedient path is the path with the fastest ICMP 
echo reply (Ping), determined by a Ping Manager Collector 
Component 364 acting in conjunction with one or more Ping 20 
Daemon Components co-located at the various NAPs in the 
system network in a step 414. These Ping Daemons deter- 
mine the fastest ICMP echo path between their particular 
front end server and the client by transmitting successive 
Pings to the client via their particular front end server's 25 
associated NAP, and timing the response, as depicted in a 
step 415. Each Ping Daemon from time to time transmits the 
time to a client via its particular NAP to the Ping Manager 
in a step 416. The Ping Manager Component returns round 
trip values for paths associated with the system's NAPs to 30 
the client in a step 417. In specific embodiments, the Ping 
Manager Component may initiate pro-active status queries 
(not depicted), or may return status information at any time, 
without explicit requests from the Director component. The 
Director component indicates the correct content server to 35 
be used for the client to the front end component in a step 
418. The front-end component directs the client to the 
correct content server using an application layer protocol, 
preferably an HTTP redirect response in a step 419. The 
front-end response is forwarded via a NAP to the client 40 
machine by the internetwork using, for example, the IP 
protocol. Subsequently, as shown in step 420, requests from 
the client will be made to the best content server, e.g., 232, 
via the best route to his machine. (Which is also the best 
place to enter the external network to get to the browsing 45 
client's own network provider.) 

FIG. 4B depicts a set of steps that occur in the process of 
receiving, evaluating, and answering a client's request in an 
alternative embodiment of the invention. Comparing the 
processing steps of the embodiment depicted in FIG. 4B 50 
with those of the embodiment depicted in FIG. 4A, it is clear 
that the particular steps are identical. However, it is note- 
worthy that steps 412, 414, 416 and 418 are not network 
transactions, as they were in FIG. 4A, but rather transactions 
between co-resident processes within one server machine. 55 

FIG. 4C depicts a set of steps that occur in the process of 
receiving, evaluating, and answering a client's request in a 
preferable embodiment of the invention. Comparing the 
processing steps of the embodiment depicted in FIG. 4B 
with those of the embodiment depicted in FIG. 4C, it is clear 60 
that the main difference is that steps 419 and 420 of FIG. 4B, 
the redirect response and the subsequent HTTP conversation 
steps, respectively, have been replaced by a new step 422. In 
step 422, the IP traffic from the client is relayed to the "best 
server" in accordance with the determination of the Director. 65 
This takes the place of the redirect response step 419 being 
made by the Front End process to the client as is shown in 



FIG. 4B. It is noteworthy that steps 402, 404, 410, 412, 413, 
414, 415, 416, 417 and 418 remain the same in the preferable 
embodiment of FIG. 4C as in the embodiments of FIGS. 4A 
and 4B. 

FIG. 4C depicts the processing steps in the preferable 
embodiment. In a step 410, a Front-End/Relayer 360 inter- 
cepts packets entering the network. Based upon address 
information contained within the IP header, the Front-End/ 
Relayer determines the original destination server of the 
packet. Next, in a step 412, the Front-End/Relayer calls upon 
the Director 362 to make the "best server" routing decision. 
In steps 413, 414, 414, 416, 417 and 418, the Director, in 
conjunction with the Load Manager 366, Ping Manager 364, 
Load Daemon and Ping Daemon components, determines a 
"best server" for the IP traffic and communicates this 
machine's address to the Front-End/Relayer. The decision 
process is identical to that of the embodiments of FIGS. 4A 
and 4B. In a step 422, the Front-End/Relayer relays all 
packets from that client to the "besf * server machine. Since 
the packet relaying takes place at the IP level, packets for 
any service running over the IP layer can be routed by the 
methods of the present invention. 

4.0 Decision-Making Methodologies 

4.1 Director 

The Director makes decisions about which Content Server 
and which router is best for each request by a Client. FIG. 
5 A depicts a flowchart 500 of the process steps undertaken 
by a Director in a specific embodiment of the invention. In 
a step 501, a network metric is calculated for each combi- 
nation of "content server" and "outgoing router". A sorted 
list is constructed of these metrics, having the format: 



(sitel border t metric I, . . . , siteN borderN metricN) 



(3) 



In a step 502, the candidate servers for the best sites are 
selected from the list produced in step 501, above. Process- 
ing traverses the list, selecting the top ranking metric, as well 
as any candidates that are within a certain percentage (say X 
%) of the top ranking metric. Note that each content server 
will be listed with all combinations of outgoing routers in the 
list. Since it is not necessary to consider server-router 
combinations which are nonsensical, for example from the 
LA server going through the NY border need not be con- 
sidered if there is a combination of LA server going through 
the LA border, i.e, only consider the "best." For example, 
with X=5, and the sorted list of combinations (site, border, 
metric) as is depicted in lines (4): 

@(network«(ny ny 300 sj la 305 sj sj 312 ny sj 380 dc ny 

400 ... ) (4) 

Hie first three entries are: 
ny ny 300 
sj la 305 
sj sj 312 

Since these are all within 5% of each other, the system will 
consider all three. However, note that San Jose is listed 
twice, and that the time of the second listing is longer than 
the time of the first. The first appearance of San Jose server 
therefore, is preferred over the second, thus the second is 
discarded in favor of the first. A server in either NY or SJ is 
selected. If a server in NY is chosen, then the border in NY 
will also be selected. If a server in SJ is chosen, the border 
in LA is "best." 

In a step 503, the "best" server is selected from the 
candidates. Make a list of all of the metrics for candidate 
servers, and apply a statistical algorithm to select the best 
server from the candidates. The result of this step is the site 



06/10/2003, EAST Version: 1.03.0002 



US 6,185,619 Bl 

15 16 

for the selected server, and an identifier for the specific $pl way {$contact_site} 

server. (There may be multiple servers at each site). $intemal{"Scontact_site Sborder"}; 

Id a step 504, from the site of the server selected above, } 

the system recalls which border is to be used for that site N ext5 m a step 5jg j me Director calculates path metrics 

saved in step 502. Then, from the server identifier and the 5 f or all site and border combinations according to the fol- 

outgoing border, the correct fully qualified domain name lowing pseudocode in lines (11): 
that has the appropriate IF address for the internal policy 

routing is determined. ## Calculate the path metrics for all site/border combinations. (11) 

4.2 Ping Manager 

The Ping Manager constantly requests ping information 10 * 

from one or more Ping Daemons. The Ping Manager sends foreach Sate ('ny', 'la', *dc% 'sj') {foreach Sborder ('ny', 

to the Director a sequence of values representative of the *la , 'dc , ( sj 

■){ 

round trip time to-and-from every client site. Non- # This calculates round-trip from server, through border, 

responsive client sites are represented by an arbitrary large # to client, back through 1 st contact border, to server, 

value if a Ping Daemon does not respond. In a particular is # It mat $ping{} array has one-way metric times, 

embodiment a Ping Manager will send to a Director a „ A1 iri 

l j • * j l i • /r\ j acm- i # Also use SinternaH \ associate array, 

sequence such as depicted by lines (5) and (6) below: u J 

# 



"ciient_address metric_jiite_l mctric_site2 . . . metric_site_N # We weigh outgoing path components twice as much, 



\n" (5) 



# because we believe it is more important to consider 



"128.9.19144 300 9999999 280 450 \n" (6) # outgoing data path. 

# 

In a preferred embodiment the site metrics are ordered. $metric-2* $plway{$border}+ 

@incooun g _ me tric_ori«.(-ny, v, •*,-) (7) „ 2 * $internal{"$site $border"} + 

$p lway { Scontact_site } + 

FIG. 5B depicts a flowchart 510 of the process steps Sinternal {"Sborder $contact_site"}; 

performed by the Director in response to receiving from the # save j t f or easv 

Ping Manager the sequence of information depicted in lines ^ 

(5) and (6) above. In a step 512 of flowchart 510, the $path{$metri c}=($path{$metric})? 

Director performs processing on the incoming information 1 t _ 7\ . * , ^ , „ , 

from the Ping Manager to store the information in a usable $path{$metnc} Ssite Sborder* : "Ssite Sborder* ; } 

format as depicted by the pseudocode in lines (8) herein * M 

b e l ow Next, in a step 520, the Director sorts the metrics and 

constructs an ordered list using the following pseudocode in 

($addr, @metrics)«split(% $ in coming pingmgr_mcssagc); (8) ^5 lines (12): 

$ping_cache{$addr}=$incoming ping_message; # Sort metrics and construct required list in order. (12) 

$ping__cache__time{$addr}-&get_current_timeO; 

# Store round-trip ping metrics in useful format. „ . ^ . , , M , tx f 

# ^ foreach Smetric (sort keys %/path) { 

@temp_keys=@incoming_metric_order, @list=split("$path{ Smetric}); 

while ($value-shift(@metrics)) {$key=shift(@temp_ while (@list){ 

keys); $site,=shift(@list); 

$pmg{$key}=$value; $border-shift(@hst); 

} « 

Next, in a step 514, in response to a request received from 
a Front End process, the Director will retrieve information 

about that request to be answered. Lines (9) depict . . . . . , , _ , . „ , 

pseudocode for processing step 514: . * mis P oint ' @so«ed__list contains aU the 

information needed to select pairs of sites and candidate 

50 r 

Retrieve info about the request that must be answered. (9) servers. 

4.3 Load Manager 

# The Load Manager sends messages to the Director about 
$request_frontend=$request_frontend{Saddr}; # who gets once every two seconds, in the format depicted in line (13): 

reply 

$COntact_site-$request_COntact {$addr}; # need for path 55 "serverl loadl server2 Ioad2 . . . serverN loadN"; (13) 

calcs FIG. 5C depicts a flowchart 521 of the process steps 

In a step 516, the Director calculates one-way - metric* performed by the Director in selecting a "besr server, using 

from the two-way metrics provided by the Ptng Manager ^ received ^ me ^ M d ia £ 

according to the following pseudocode in lmes (10): m m ^ (13) ^ ^ ^ Manger> depicted fa ^ (5) 

m r* i i . ™ ,m and (6) above. 

# Calculate one-way metnes. (10) y ' 

The Director maintains several internal data structures, 

# including an associative array of server loads, pairs load 
$plway{$contact__site}=$ping{$contact_site}/ 2; values with server identifiers: 

foreach Sborder fny' 'la', 'dc', 'si') {next if (Sborder eq 65 . rr ri . 

a, -\ » j / i v t $load_array{$serverrD}-$load; 

$contact_site); 

Splwayj Sborder} =$ping{$border} a list of servers at each site: 



push(@sorted_list, Ssite, Sborder, Smetric); 

} 

} 
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Sservers {' la '}=" www. lal.iest.com wwwJa2.test.com"; 

Sservers {'dc'}="www.dcl. test.com www.dc2.test.com"; 
a mapping of servers and borders into a correct name that 
has a corresponding source address: 

$server„map{"www.lal .test.com la"}»"www.la 1- 
la.test.com"; 

$server_map{"www.lal .test.com dc"}-"www.lal- 
dc.test.com"; 

The variable @sorted_list, generated above, contains 
triples of sites, borders, and path metrics. From these data 
structures, the Director can choose an optimal server using 
the steps shown in FIG. SC. A decisional step verifies that all 
entries in the @sortedJist have not been processed, and 
initializes the £ %est_metric" variable to the metric of the 
first member of the @sorted_list. These steps are depicted 
in the following pseudocode: 

%site_border=G; 

@server_Jist-0; 

## Top loop chooses next set of candidate paths (and 
sites); 

## loop is broken when we have selected a server. 
# 

while (1) { 

break unless (@sorted_list); 

$best_metric=$sorted_list[2] * Sfuzz_factor; # $fuzz__ 
factor«1.05; 

As shown in FIG. 5C, a decisional step 522, a process step 
524, a decisional step 526, and a process step 528 implement 
a looping construct for selecting triples of site, border and 
metric from the @sorted_ list and adding the border infor- 
mation to a site border list and the site to a server list. This 
is also depicted in the following pseudocode: 

while (@sorted__list && ($sorted_list[2]<=$best_ 
metric)) { 

$site=shift(@sorted_list); 

$border-shift(@sorted_list); 

$metric=shift(@sorted_list); 

# check if site already has better outgoing border 

if (! $site__border{$site}) {$site_border{$site}«$border; 
push(@server_list, split(", $servers{$site})); 

} 

} 

Once this loop has processed all members of the @sorted_ 
list, decisional step 522 will take the "yes" path, and 
processing will continue with a decisional step 530, which, 
along with process steps 532 and 534, implements a looping 
construct for processing all servers in the server list gener- 
ated above, and adding each server's load to a server load 
list, and totaling the loads of all servers in a Stotal variable. 
This is also depicted in the following pseudocode: 

@server_Ioad=0; 

StotaM); 

fcreach Sserver (@server list) {push(@server load, 

$load_array{ Sserver}); Stotal + = $load_ 
array{$server}; 

} 

next unless (Stotal); # check for all servers busy 
Once this loop has processed all members of the @server_ 
list, decisional step 530 of FIG. 5C will take the "yes" path, 
and processing will continue with a process step 536, which 
determines a random number between 1 and the total server 
load. Next, processing continues in the loop formed by 
decisional steps 538 and 544 and processing steps 540, 542, 
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and 546. This loop steps through the servers in the 
@server__list, summing their loads until this running total 
exceeds the random number selected in process step 536, or 
until the last member of the @server_list has been pro- 
cessed. In either case, the server being examined when either 
of these two conditions is fulfilled, is the server selected by 
the Director as the "best" server. This is also depicted in the 
following pseudocode: 

# tricky to handle fencepost values. 

# Should return value between 1 and Stotal (inclusive); 
# 

$random=roll_dice($total); 
StotaM); 

while ($the_server«shift(@server__list)) {$value=shift 

(@server_load); 
next unless (Svahie); 
Stotal +»$value; 
last if ($random<-$total); 
} 

last; ## We have answer, break out of while loop . . . 

Once the director has selected a "best" server, processing 
continues as described by step 504 of FIG. 5A, and the 
following pseudo code: 

# Selected server is at which site??? 
# 

$siteo$get__site { $the_server} ; 

# Recall information about best border for this site 
# 

$border=$site_border{$site} ; 

# Get the correct name which will use source address we 
need. 



$ response_to_request =$serve r_m ap {" $t he_se rver 

Sborder"}; 
5.0 Conclusion 

40 In conclusion, it can be seen that the present invention 
provides for an interaetworked system wherein data objects 
are served to users according to the shortest available 
network path. A further advantage to the approaches accord- 
ing to the invention is that these methods exhibit tolerance 

45 of faults occurring in the underlying hardware. Additionally, 
reliability of systems according to the invention are 
increased over prior art web servers. Other embodiments of 
the present invention and its individual components will 
become readily apparent to those skilled in the art from the 

50 foregoing detailed description, wherein is described embodi- 
ments of the invention by way of illustrating the best mode 
contemplated for carrying out the invention. As will be 
realized, the invention is capable of other and different 
embodiments and its several details are capable of modifi- 

55 cations in various obvious respects, all without departing 
from the spirit and the scope of the present invention. 
Accordingly, the drawings and detailed description are to be 
regarded as illustrative in nature and not as restrictive. It is 
therefore not intended that the invention be limited except as 

50 indicated by the appended claims. 
What is claimed is: 

1. A system for routing requests for data objects from a 
plurality of clients comprising: 

a plurality of content servers configured to serve said data 
65 objects; 

a director component for routing said requests for data 
objects, said director component further comprising: 
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a server selection component configured to select a servers are comprised of a plurality of IP addresses that 

content server from said plurality of content servers each route data objects from said content servers along 

based upon a policy; a different path; 

an identification component configured to identify a determining a best server from said plurality of content 

plurality of routes from a requesting client to said 5 servers according to a policy; 

content server, identifying a plurality of routes from said client to said 

a determination component configured to determine a • 

best server* 

time to traverse each of said plurality of routes; and * 

a route selection component configured to: determining a time to traverse each of said plurality of 

select a route from said requesting client to said io routes; 

content server having the shortest time to traverse; selecting one of said routes having the shortest time to 

and traverse; and 

identify one of a plurality of IP addresses associated informing said client of one of said plurality of IP 

with said content server that directs content to the addresses associated with the best server that directs 

requesting client, from the content server, along is content to the client, from the best server, along said 

the selected route. route. 

2. The system of claim 1 wherein said director component 10. The method of claim 9 wherein said informing a 
includes particular client further comprises: 

an answer component for answering a request for a data answering a request for a data object with a response that 

object with a response that redirects a request from one 20 causes said particular client to redirect said request to 

of said plurality of clients to said selected content bes t server. 

server along said route. U. The method of claim 10 wherein said response that 

3. The system of claim 2 wherein said answ er com ponent redirects uses an HTTP redirect response. 

includes a redirection component that uses an HTTP redirect u. The method of claim 9 wherein determining said best 

response. 25 server according to said policy further comprises selecting a 

4. The system of claim 1 wherein said determination least loaded content server from said plurality of content 
component includes a measurement component that mea- servers. 

sures the time to traverse said route using an ICMP echo 13. The method of claim 12 wherein determining said best 
reply. 

server further comprises: 

5. The system of claim 1 wherein said server selectioo 30 deterrnining the number of open TCP for 
component inc u es. eac k ^q^^ server in said plurality of content servers; 

a determination component configured to determine the ^ 

number of open TCP connections for each content . . . A . , , , 

•j i v* c 44 j selecting as the least loaded content server the content 

server in said plurality of content servers; and & , . iL . , - 

n ' - c server having the least number of open TCP connec- 

a selection component configured to select a content 35 tions 

server having the least number of open TCP oonnec- R ^ method of claim u whwrill determming a best 

tions. server further comprises: 

6. The system of claim 1 wherein said server selection t . . « 

component includes* determining the amount of available free RAM for each 

j, ... . - , , , . . ,,40 content server in said plurality of content servers; and 

a determination component configured to determine the w r J ' 

amount of available free RAM for each content server selecting as the least loaded content server the content 

in said plurality of content servers; and s*™ T naviD S ^ lar S est a™ 01101 of available free 

a selection component configured to select a content * * ^ , • „~ . 

server having the largest amount of available free 15. The method ofclaim 12 wherein deterrnining said best 

j^j^j 45 server further comprises: 

7. The system of claim 1 wherein said server selection determining the amount of available free SWAP for each 
component includes: content server in said plurality of content servers; and 

a determination component configured to determine the selecting as the least loaded content server the content 

amount of available free SWAP for each content server server having the largest amount of available free 

in said plurality of content servers; and SWAP. 

a selection component configured to select a content 16. The memod ofclaim 12 wherein determining said best 

server having the largest amount of available free server further comprises: 

SWAP. determining the amount of CPU idle time for each content 

8. The system of claim 1 wherein said server selection 5S server in said plurality of content servers; and 
component includes: selecting as the least loaded content server the content 

a determination component configured to determine the server having the highest amount of CPU idle time, 

amount of CPU idle time for each content server in said 17. The method of claim 9 wherein said time to traverse 

plurality of content servers; and is determined by measuring the time associated with an 

a selection component capable of selecting the content gQ ICMP echo reply made over said route, 

server having the highest amount of CPU idle time. 18. A method for routing information from a plurality of 

9. A method for routing requests for data objects from a clients comprising the steps of: 

plurality of clients comprising the steps of: receiving a request for a data object from one of a 

receiving a request for a data object from one of said plurality of clients; 

plurality of clients; 65 determining a best server from said plurality of content 

providing a plurality of content servers capable of serving servers according to a policy associated with a packet 

data objects, wherein each of said plurality of content networking environment; 
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determining one of a plurality of routes from said request- 
ing client to said best server, said route having the 
shortest time to traverse; and 

relaying said request back to said client, wherein said 
request is comprised of one of a plurality of IP 5 
addresses associated with said best server that directs 
content corresponding to the clients request to the client 
along said route. 

19. The method of claim 18 wherein said relaying infor- 
mation from a particular client further comprises: 10 

forwarding said information to said best server using 
network layer protocols. 

20. The method of claim 18 wherein determining a best 
server further comprises selecting a least loaded content 
server from said plurality of content servers. 15 

21. The method of claim 18 wherein said determining a 
route further comprises: 

identifying a plurality of routes from said requesting 
client to said best server; 2Q 

determining a time to traverse each of said plurality of 
routes; and 

selecting one of said plurality of routes having the shortest 
time to traverse said route. 

22. The method of claim 21 wherein said time to traverse 25 
is determined by an ICMP echo reply. 

23. The method of claim 18 wherein determining a best 
server further comprises: 

determining the number of open TCP connections for 
each content server in said plurality of content servers; 30 
and 

selecting as the best server the content server having the 
least number of open TCP connections. 

24. The method of claim 18 wherein determining a best 
server further comprises: 35 

determining the amount of available free RAM for each 
content server in said plurality of content servers; and 

selecting as the best server the content server having the 
largest amount of available free RAM. 40 

25. The method of claim 18 wherein determining a best 
server further comprises: 

determining the amount of available free SWAP for each 
content server in said plurality of content servers; and 

selecting as the best server the content server having the 45 
largest amount of available free SWAP. 
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26. The method of claim 18 wherein determining a best 
server further comprises: 

determining the amount of CPU idle time for each content 
server in said plurality of content servers; and 

selecting as the best server the content server having the 
highest amount of CPU idle time. 

27. A system for routing requests for data objects from a 
plurality of clients comprising: 

plurality of content servers that serve said data objects, 
each of said plurality of content servers comprising a 
plurality of IP addresses; and 

a director component that identifies a least loaded server 
based upon a policy that: 

selects said least loaded server from one of said plu- 
rality of servers; 

selects one of a plurality of routes from a requesting 
client to said least loaded server, wherein said route 
has the shortest time to traverse; and 

identifies an IP address associated with the least loaded 
server that directs content to the client along said 
route. 

28. The system of claim 27 wherein said director com- 
ponent further includes: 

a selection component capable of selecting a server from 
said plurality of servers based upon said policy; and 

an answer component capable of answering a request for 
a data object by indicating to one of said plurality of 
clients to redirect the request to said server selected by 
said selection component. 

29. A method of routing requests for data objects from a 
plurality of clients 

comprising the steps of: 

providing a plurality of content servers that serve said 
data objects; and 

routing said requests for data objects back to a request- 
ing client for subsequent transmission to one of said 
plurality of servers, wherein each of said requests is 
comprised of one of a plurality of IP addresses 
associated with said content server that directs con- 
tent corresponding to the client's request from said 
content server along a route having the shortest time 
to traverse. 

***** 
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FIG. 1b 
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FIG. 1c 
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FIG. 3 



Client content 
request arrives 



402 



400 



Query CSD for 
available servers 
(FIG. 4) 



404 



Yes- 



Redirect flow to 
remote server 



Spoof TCP 
connection 




Reject flow with 
appropriate error 







422 



Done 



408 



Ask FAC to 
assign flow to 
local server 
(FIG. 17) 




1 

Set up 
address t 


network 
ranslation 




r 


'Connect 1 to 
appropriate server 




r 


Forward content - 
request to server 
as appropriate 



416 



426 



428 



06/10/2003, EAST Version: 1.03.0002 



U.S. Patent Sep. 10, 2002 Sheet 6 of 26 US 6,449,647 Bl 



FIG. 4 



404 



446 



Ask ICP/IPP to locate 
requested content 



Parse content 
request 



Query database for|— S 

servers serving 
requested content 



432 



429 



430 




1 



Status = REJECT 



434 



Status = SPOOF 



440 



-Yes 



444 




Done 



Query CCD for 
available client 
information 



453 No 



-No- 



Yes 



442 



Assign value to 
Status according 
to FIG. 5 



-Yes 




Create new 
record using 
default data 



452 



Ask ICP to get 
better data for 
future requests 



06/10/2003, EAST Version: 1.03.0002 



U.S. Patent Sep. 10, 2002 Sheet 7 of 26 US 6,449,647 Bl 



FIG. 5 



442 



454 




466 



Yes 



1 




Attempt to locate 


client in 'stuck 1 pool 




L 458 



456 




Yes 



List contains only 


^460 


one server 




I 1 




462 






Status = 


ACCEPT 





Done 



464 



Status = REJECT 



06/10/2003, EAST Version: 1.03.0002 



U.S. Patent Sep. 10,2002 Sheet 8 of 26 



US 6,449,647 Bl 



FIG. 6 



456 



Evaluate requested content 
(FIG. 7) 



468 



Filter candidate server list and order 
remaining candidate servers 
(FIG. 8) 



470 



Assign proximity preferences to 
remaining servers 
(FIG. 22) 



472 




476 



Yes 



478 



Status = ACCEPT 



< 



Status = REDIRECT 


1 


f 



480 



Done 



06/10/2003, EAST Version: 1.03.0002 



U.S. Patent Sep. 10, 2002 Sheet 9 of 26 US 6,449,647 Bl 



FIG. 7 



492 



Recalculate avglnterval 


494 

V , 


f 


-Yes- 


requestFlag 


= BURSTY 




484 



burstLength 


= 0 






486 






y 


requestFlag 


= 0 




06/10/2003, EAST Version: 1.03.0002 



U.S. Patent Sep. 10, 2002 Sheet 10 of 26 



US 6,449,647 Bl 



FIG. 8 



514 

1_ 

Select first server in list 



Select next server 
in list 




-Yes- 



Evaluate server 
(FIG. 10) 



Yes- 



520 



522 



518 



Remove server from 
candidate server list 



Apply ordering rules 
(FIG. 11) 




528 



06/10/2003, EAST Version: 1.03.0002 



U.S. Patent 



Sep. 10, 2002 Sheet 11 of 26 



US 6,449,647 Bl 




06/10/2003, EAST Version: 1.03.0002 



U.S. Patent Sep. 10, 2002 Sheet 12 of 26 



US 6,449,647 Bl 



FIG. 10 



540 



570 




06/10/2003, EAST Version: 1.03.0002 



U.S. Patent Sep. 10, 2002 Sheet 13 of 26 



US 6,449,647 Bl 



FIG. 11 



582 

y 



Rule 3 
Status = FIG. 14 



572 




Yes-* 



574 



Rule 1 
Status = FIG. 12 



Yes-* 



578 



Rule 2 
Status = FIG. 13 



586 



Rule 4 
Status = FIG. 15 



Yes 



588 



Rule 5 
Status = FIG. 16 



590 

^Status = OKAY?S= No- 




594 



Server is not optimal 



Done *)*- 



06/10/2003, EAST Version: 1.03.0002 



U.S. Patent 



Sep. 10, 2002 Sheet 14 of 26 



US 6,449,647 Bl 



FIG. 12 




Yes 



Yes 



Yes 



Yes 



No 



604 



Status = OKAY 



606 

y 



Status = NOT OPTIMAL 



' — ^ Done "*)«- 



06/10/2003, EAST Version: 1.03.0002 



U.S. Patent Sep. 10,2002 Sheet 15 of 26 US 6,449,647 Bl 



FIG. 13 



-Yes 



-Yes 



616 

y 



Status = NOT OPTIMAL 



608 




Yes 



No 



614 



Status = OKAY 



-» ( Done 

618 



06/10/2003, EAST Version: 1.03.0002 



U.S. Patent Sep. 10, 2002 Sheet 16 of 26 US 6,449,647 Bl 



FIG. 14 



620 



622 



624 




Status = OKAY 




r 



( Done ) « 



06/10/2003, EAST Version: 1.03.0002 



U.S. Patent Sep. 10, 2002 Sheet 17 of 26 US 6,449,647 Bl 



FIG. 15 



646 



648 




Q Done ^* 



y, — 


Status = NOT OPTIMAL 









06/10/2003, EAST Version: 1.03.0002 



U.S. Patent Sep. lO, 2002 Sheet 18 of 26 US 6,449,647 Bl 




06/10/2003, EAST Version: 1.03.0002 



U.S. Patent 



Sep. 10, 2002 Sheet 19 of 26 



US 6,449,647 Bl 




06/10/2003, EAST Version: 1.03.0002 



U.S. Patent Sep. 10, 2002 Sheet 20 of 26 US 6,449,647 Bl 



FIG. 18 



690 



726 



Determine ingress 
and egress ports 



728 
VJ 



Construct QoS tags 
(FIG. 19) 




Assign flow to flow pipe 



Assign flow to VC pipe 



736 




734 



Done 



06/10/2003, EAST Version: 1.03.0002 



U.S. Patent Sep. 10, 2002 Sheet 21 of 26 



US 6,449,647 Bl 



FIG. 19 



-No- 



740 



742 

M 



Find minimum 
bandwidth MinBW 




Find average bandwidth AvgBW 



Find minimum bandwidth MinBW 



Is content\_ Yes - 
streamed? v 




728 



746 





Find burst tolerance btol 








* 




Find peak bandwidth PeakBW 



Find buffer requirements Buffers 



Construct QoS Tag from Buffers 
and MinBW 



754 



760 



Done 



Find all similar 
QoS tags 
(FIG. 20) 



06/10/2003, EAST Version: 1.03.0002 



U.S. Patent 



Sep. 10, 2002 Sheet 22 of 26 US 6,449,647 Bl 



FIG. 20 



754 



764 



766 




Yes- 



Find all existing QoS 
tags with higher MinBW 
and lower Buffers 



768 




Find all existing QoS tags 
with lower MinBW and 
higher Buffers 



Yes- 



For each QoS tag: 

- Calculate average bandwidth 

- Calculate new TCP window size 

- Verify TCP window is at least 4K 



772 



For each QoS tag. verify that Buffers 
= (PeakBW - MinBW) * btol 



C 



Done 



06/10/2003, EAST Version: 1.03.0002 



U.S. Patent Sep. 10, 2002 Sheet 23 of 26 US 6,449,647 Bl 



FIG. 21a 



100d 



100a 

.2— 



Web Server 



110 

y 

Content-Aware Flow Switch 



784a 







' 782b 




! / 




i 782c 




! / 





100e 




o 
5' 

3 

CO 

S 

CD 



06/10/2003, EAST Version: 1.03.0002 



U.S. Patent Sep. 10, 2002 Sheet 24 of 26 



US 6,449,647 Bl 




06/10/2003, EAST Version: 1.03.0002 



U.S. Patent Sep. 10, 2002 Sheet 25 of 26 US 6,449,647 Bl 



FIG. 22 



800 



806 





Don't assign 
proximity preference 
to any server 



Prune list of servers to those 
in same continent or unknown 



870 



Assign proximity 
preference to 
server in same 
continent as client 



818 



Assign proximity 
preferences to 
identified servers 



808 

'More thari^ 
one server in 
v same continent as v 
client? 



816 



-Yes- 



812 



Look up client AS 



Identify servers with 
AS's assigned to same 
ISP as any client AS 



814 



Look up server AS's 



Verify path 
between client and 
preferred server(s) 



820 



Place preferred 
servers at top of 
candidate server list 



822 



06/10/2003, EAST Version: 1.03.0002 



U.S. Patent Sep. 10, 2002 Sheet 26 of 26 US 6,449,647 Bl 



FIG. 23 



1010 

REMOVABLE 
DISK DRIVE 



1030 

y 

DISK DRIVE 
CONTROLLER 



I/O INTERFACE 



1020 



MASS STORAGE 
INTERFACE 



1050 



1040 



I/O BUS 



1080 



PROCESSOR 



1090 



I/O CONTROLLER 



1100 



CPU BUS 



1110 



RAM 



ROM 



1120 



06/10/2003, EAST Version: 1.03.0002 



US 6,449,647 Bl 

1 2 

CONTENT-AWARE SWITCHING OF application-layer server application using TCP or UDP as its 

NETWORK PACKETS transport layer. The content itself may be, for example, a 

simple ASCII text file, a binary file, an HTML page, a Java 

This application is a Continuation of U.S. application applet, or real-time audio or video. 

Ser. No. 09/050,524, Filed Mar. 30, 1998, now issued U.S. 5 a "flow" is a series of frames exchanged between two 

Pat. No. 6,006,264, which claims priority from U.S. Provi- connection endpoints defined by a layer 3 network address 

sional Application Ser. No. 60/054,687, Filed Aug. 1, 1997. and a layer 4 port number pair for each end of the connec- 

DrixDCMrcc ™ dct Aim a ddt in vnnwc don - Typically, a flow is initiated by a request at one of the 

REFERENCES TO RELATED APPLICATIONS ^ endpoints for content which is accessible 

This application claims priority from a provisional appli- 10 through the other connection endpoint. The flow that is 

cation Ser. No. 60/054,687, filed Aug. 1, 1997, which is created in response to the request consists of (1) packets 

hereby incorporated by reference. containing the requested content, and (2) control messages 

exchanged between the two endpoints. 

BACKGROUND OF THE INVENTION ^ classfflcation techniques ^ used to pri . 

The present invention relates to content -based flow 15 ority codes with flows based on their Quality of Service 

switching in Internet Protocol (IP) networks. (QoS) requirements. Such techniques prioritize network 

IP networks route packets based on network address requests by treating flows with different QoS classes differ- 

information that is embedded in the headers of packets. In cntl y when mc flows compete for limited network resources, 

the most general sense, the architecture of a typical data Flows in the same QoS class arc assigned the same priority 

switch consists of four primary components: (1) a number of code * A flow classification technique may, for example, 

physical network ports (both ingress ports and egress ports), classify flows based on IP addresses and other inner protocol 

(2) a data plane, (3) a control plane, and (4) a management header fields. For example, a QoS class with a particular 

plane. The data plane, sometimes referred to as the priority may consist of all flows that are destined for 

"fastpath," is responsible for moving packets from ingress destination IP address 142.192.7.7 and TCP port number 80 

ports of the data switch to egress ports of the data switch and TOS of 1 (TVl* of Service field in the IP header). This 

based on addressing information contained in the packet technique can be used to improve QoS by giving higher 

headers and information from the data switch's forwarding priority flows better treatment. 

table. The forwarding table contains a mapping between all Internet Service Providers (ISPs) and other Internet Con- 
the network addresses the data switch has previously seen 3Q tent Providers commonly maintain web sites for their cus- 
and the physical port on which packets destined for that tomers. This service is called web hosting. Each web site is 
address should be sent. Packets that have not previously associated with a web host. A web host may be a physical 
been mapped to a physical port are directed to the control web server. A web host may also be a logical entity, referred 
plane. The control plane determines the physical port to to as a virtual web host (VWH). A virtual web host associ- 
which the packet should be forwarded. The control plane is 35 ated with a large web site may span multiple physical web 
also responsible for updating the forwarding table so that servers. Conversely, several virtual web hosts associated 
future packets to the same destination may be forwarded with small web sites may share a single physical web server, 
directly by the data plane. The data plane functionality is In either case, each virtual web host provides the function- 
commonly performed in hardware. The management plane ality of a single physical web server in a way that is 
performs administrative functions such as providing a user ^ transparent to the client. The web sites hosted on a virtual 
interface (UI) and managing Simple Network Management web host share server resources, such as CPU cycles and 
Protocol (SNMP) engines. memory, but are provided with all of the services of a 

Packets conforming to the TCP/IP Internet layering model dedicated web server. A virtual web host has one or more 

have 5 layers of headers containing network address public virtual IP address that clients use to access content on 

information, arranged in increasing order of abstraction. A 45 the virtual web host. A web host is uniquely identified by its 

data switch is categorized as a layer N switch if it makes public IP address. When a content request is made to the 

switching decisions based on address information in the Nth virtual web host's virtual IP address, the virtual IP address 

layer of a packet header. For example, both Local Area is mapped to a private IP address, which points either to a 

Network (LAN, layer 2) switching and IP (layer 3) switching physical server or to a software application identified by 

switch packets based solely on address information con- 50 both a private IP address and a layer 4 port number that is 

tained in transmitted packet headers. In the case of LAN allocated to the application. 

switching, the destination MAC address is used for 

switching, and in the case of IP switching, the destination IP SUMMARY OF THE INVENTION 

address is used for switching. In one aspect, the invention features content-aware flow 

Applications that communicate over the Internet typically 55 switching in an IP network. Specifically, when a client in an 
communicate with each other over a transport layer (layer 4) IP network makes a content request, the request is inter- 
Transmission Control Protocol (TCP) or User Datagram cepted by a content-aware flow switch, which seamlessly 
Protocol (UDP) connection. Such applications need not be forwards the content request to a server that is well-suited to 
aware of the switching that occurs at lower levels (levels serve the content request. The server is chosen by the flow 
1-3) to support the layer 4 connection. For example, an 60 switch based on the type of content requested, the QoS 
HyperText Transfer Protocol (HTTP) client (also known as requirements implied by the content request, the degree of 
a web browser) exchanges HTTP (layer 5) control messages load on available servers, network congestion information, 
and data (payload) with a target web server over a TCP and the proximity of the client to available servers. The 
(layer 4) connection. entire process of server selection is transparent to the client. 

"Content" can be loosely defined as any information that 65 In another aspect, the invention features implicit deduc- 

a client application is interested in receiving. In an IP tion of the QoS requirements of a flow based on the content 

network, this information is typically delivered by an of the flow request After a flow is detected, a QoS category 
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is associated with the flow, and buffer and bandwidth client is preferred over servers in another continent. Trans- 
resources consistent with the QoS category of the flow are continental network links introduce delay and are frequently 
allocated. Implicit deduction of the QoS requirements of congested. The server selection process tends to avoid such 
incoming flow requests allows network applications to sig- trans-continental links and the bottlenecks they introduce, 
nificantly improve their Quality of Service (QoS) behavior 5 Another advantage of the invention is that, when perform- 
by (1) preventing over-allocation of system resources, and mg server selection, a server that shares a "closest" back- 
(2) enforcing fair competition among flows for limited bone ISP with the client is preferred. Backbone ISPs connect 
system resources based on their QoS classes by using a strict with one another at Network Access Points (NAP). NAPs 
priority and weighted fair queuing algorithm. frequently experience congestion. By selecting a path 
In another aspect, the invention features flow pipes, which 10 between a client and a server that does not include a NAP, 
are logical pipes through which all flows between virtual bottlenecks are avoided. 

web hosts and clients travel. A single content-aware flow Qther featlir es and advantages of the invention will 

switch can support multiple flow pipes, A configurable become apparent from the following description and from 

percentage of the bandwidth of a content-aware flow switch me claims, 

is reserved for each flow pipe. is 

In another aspect, the invention features a method for BRIEF DESCRIPTION OF THE DRAWINGS 

selecting a best-fit server, from among a plurality of servers, « . 

to service a client request for content in an IP network. A nG " la 15 a block of an IP network * 

location of the client is identified. A location of each of the FIG* 1ft is a block diagram of a segment of a network 

plurality of servers is identified. Servers that are in the same 20 employing a content-aware flow switch. 

location as the client are identified. A server from among the FIG. lc is a block diagram of traffic flow through a 

plurality of servers is selected as the best-fit server, using a content-aware flow switch. 

method which assigns a proximity preference to the identi- piG. 2 is a block diagram illustrating operations per- 

fied servers. The location of the client may be a continent in formed by and communications among components of a 

which the client resides. The location of each of the plurality 25 content-aware flow switch during flow setup. 

of servers may be a continent in which the server resides. c ir , ^ ■ n « „ r i , r n , r . . 

o *i_ * • *i_ » . rlu. 3 is a now chart of a method for servicing a content 

Servers that are in the same location as the client may be . „ * + 4 « - 4 , 

.-c j L j . jr . ... , - . • request using a content-aware flow switch, 

identified by identifying administrative authorities associ- „ . . „ t „ 

ated with the client based on its IP address, identifying, for FIG - 4 15 a flow dasn of a method for P arem S a flow setu P 

each of the plurality of servers, administrative authorities 30 rec l uest - 

associated with the server, and identifying servers associated FIGS. 5 and 6 are flow charts of methods for sorting a list 

with an administrative authority that is associated with the of candidate servers. 

client. The administrative authorities may be Internet Ser- FIG. 7 is a flow chart of a method for evaluating requested 

vice Providers. content. 

One advantage of the invention is that content-aware flow 35 FIG. 8 is a flow chart of a method for sorting a list of 

switches can be interconnected and overlaid on top of an IP candidate servers. 

network to provide content-aware flow switching regardless FIG 9 is a flow chart of a method for filting from 

of the underlying technology used by the IP network. In this a ^ of caD didate servers. 

way, the invention provides content-aware flow switching m a u . * ^ *c % ^ 

.i « J jC * ji~ r • i- T £ 40 FIG. 10 is a flow chart of a method for evaluating a server 

without requiring modifications to the core of existing IP . ,. „ c , ^ 

, & & in a list of candidate servers, 
networks. 

Another advantage of the invention is that by using . H <?- U . " \ fl ° w chart of a method for orderin S a 

. . a . • c c 11 in a list or candidate servers, 
content-aware now switching, a server tarm may gracefully 

absorb a content request spike beyond the capacity of the FIGS 12r ~ 16 m flows charts of methods for assigning a 

farm by directing content requests to other servers. This status t0 . a server for purposes of ordering the server in a list 

allows mirroring of critical content in distributed data °^ can didate servers. 

centers, with overflow content delivery capacity and backup FIG. 17 is a flow chart of a method for assigning a flow 

in the case of a partial communications failure. Content- to a local server. 

aware flow switches also allow individual web servers to be $Q FIG. 18 is a flow chart of a method for attempting to 

transparently removed for service. satisfy a request for a flow. 

Another advantage of the invention is that it performs FIG. 19 is a flow chart of a method for constructing a QoS 

admission control on a per flow basis, based on the level of tag. 

local network congestion, the system resources available on p, G 20 is a flow chaft of a method for ]o ^ { - QoS 

the content-aware flow switch, and the resources available 5S which m simflaf to a - en QoS 

on the web servers front -ended by the flow switch. This _ ' , ° . ^ 

aUows resources to be allocated in accordance with indi- u nG f 21fl ~ 6 are blo f dia S™f of flow pipe traffic 

vidual flow QoS requirements. mrou & h a cogent-aware flow switch. 

One advantage of flow pipes is that the virtual web host . nG * 2 \ * a flow chart of a method for orderin g ***** 

associated with a flow pipe is guaranteed a certain percent- 6 o mzhsi of candldate t*™™ based on proximity, 

age of the total bandwidth available to the flow switch, FIG. 23 is a block diagram of a computer and computer 

regardless of the other activity in the flow switch. Another elements suitable for implementing elements of the inven- 

advantage of flow pipes is that the quality of service pro- uon - 

vided to the flows in a flow pipe is tailored to the QoS _____ _ rc ^ nT __ _ 

requirements implied by the content of the individual flows. 6 s DETAILED DESCRIPTION 

Another advantage of the invention is that, when perform- Referring to FIG. la, in a conventional IP network 100, 

ing server selection, a server in the same continent as the such as the Internet, servers are connected to routers at the 
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edges of the network 100. Each router is connected to one The flow switch 110 is connected to an internet through 

or more other routers. Each stream of information transmit- uplinks 155a-<:. When a client content request is accepted by 

ted from one end station to another is broken into packets the flow switch 110, the flow switch U0 establishes a 

containing, among other things, a destination address indi- full-duplex logical connection between the client and one of 

eating the end station to which the packet should be deliv- 5 ^ web servers lOOa-c through the flow switch 110. Indi- 

ered. Apacket is transmitted from one end station to another vidual flows m aggregated into pipes, as described in more 

via a sequence of routers. For example, a packet may detail below. Request traflBc flows from the chent toward the 

originate at server SI, traverse routers Rl, R2, R3, and R4, »™ r and ^ flow * to the c |i ent - 

and then be delivered to server S2. A component of the flow^vitch 110, referred to as the Flow 

Admission Control (FAQ, polices if and how flows are 

In FIG. la, a network node is either a router or an end 10 admitted to the flow U0 , as described in more detail 

station. Each router has access to information about each of below. 

the nodes to which the router is connected. When a router ^ content-aware flow switch 110 differs from typical 

receives a packet, the router examines the packet's destina- i ayer 2 and layer 3 switches in several respects. First, the 

tion address, and forwards the packet to a node that the data plane of layer 2 and layer 3 switches forwards packets 

router calculates to be most likely to bring the packet closer 15 based on the destination addresses in the packet headers (the 

to its destination address. The process of choosing an MAC address and header information in the case of a layer 

intermediary destination for a packet and forwarding the 2 switch and the destination IP address in the case of a layer 

packet to the intermediary destination is called routing. 3 switch). The content-aware flow switch U0 switches 

For example, referring to FIG. la, server SI transmits a packets based on a combination of source and destination IP 

packet, whose destination address is server S2, to router Rl. 20 addresses, transport layer protocol and transport layer 

Router Rl is only connected to server SI and to router R2. ^ destination port numbers. Furthermore, the func- 

Router Rl therefore forwards the packet to router R2. When ? ons Performed in the control plane of typical layer 2 and 

the packet reaches router R2, router R2 must choose to ^ are bas ^ ° n examination ofthe layer 2 and 

- r Jt , l4i c . mr>*ni jn* la ver 3 headers, respectively, and on well-known bndging 

forward the packet to one of routers Rl, R5,R3, and R6 ^ rQUti ^ ^/^^ lane of ^ con g n * 

based on the packet s destination IP address. TTie packet is awafe flow 6 s ^ tcb 110 ^ performs these functions, but 

passed from router to router until it reaches its destination of addit i ona u y derives me f orwa rding path from information 

server S2. contained in the packet headers up to and including layer 5. 

Referring to FIG. lb, web servers lOOa-c and 120a-6 are In addition, content-induced QoS and bandwidth 

connected to a content-aware flow switch 110. The web 3Q requirements, server loading and network path optimization 

servers lOOa-c are connected to the flow switch 110 over are also considered by the content-aware flow switch 110 

LAN links 105a-c. The web servers 120a-b are connected when selecting the most optimal path for a packet, as 

to the flow switch 110 over WAN links \Zla-b. The flow described in more detail below. 

switch U0 may be configured and its health monitored using piG. 2 is a block diagram illustrating, at a high level, 
a network management station 125. The role of the man- 3S operations performed by and communications among com- 
agement station 125 is to control and manage one or more ponents of the content-aware flow switch 110 during flow 
communications devices from an external device such as a setup. An arrow between two components in FIG. 2 indi- 
workstation running network management applications. The ca tes that communication occurs in the direction of the 
network management station 125 communicates with net- arrow between the two components connected by the arrow, 
work devices via a network management protocol such as ^ Referring to FIG. 2, the content-aware flow switch 110 
the Simple Network Management Protocol (SNMP). The mc i ude s: a Web Flow Redirector (WFR), an Intelligent 
flow switch 110 may connect to the network 100 (FIG. la) Gutoa*. (ICP)> a Content Server Database (C SD), a 
through a router 130. The flow switch 110 is connected to the Capability Database (CCD), a How Admission Coo- 
router 130 by a LAN or WAN link 132. Alternatively, the trol (FAC)> aQ Intcmet p robc Prot0 col (IPP), and an Internet 
flow switch 110 may connect to the network 100 direcdy via 45 Proximity Assist (IPA) 

one or more WAN links (not shown). T^e router 130 ^ CSD raainlains ' scvcra i databases containing infor- 

connects to an Internet Service Provider (ISP) (not shown) matkm about COQtent flow characteristics> ^1 locality, 

by multiple WAN links 135a-c. and ^ location of ^ thc bad Qn ^ m 

Referring to FIG. lc, a content-aware flow switch "front- 100a-c and 120a-b. One database maintained by the CSD 
ends" (i.e., intercepts all packets received from and trans- 50 contains content rules, which are defined by the system 
mitted by) a set of local web servers WOa-c, constituting a administrator and which indicate how the flow switch 110 
web server farm 150. Although connections to the web should handle requests for content. Another database main- 
servers lOOa-c are typically initiated by clients on the client tained by the CSD contains content records which are 
side, most of the traffic between a client and the server farm derived from the content rules. Content records contain 
150 is from the servers 100fl-c to the client (the response 55 information related to particular content, such as its associ- 
trafiBc). It is this response traffic that needs to be most ate d IP address, URL, protocol, layer 4 port number, QoS 
carefully controlled by the flow switch 110, indicators, and the load balance algorithm to use when 

The flow switch 110 has a number of physical ingress accessing the content. A content record for particular content 

ports 170a-c and physical egress ports 165**-c. Each of the also points to server records identifying servers containing 

physical ingress ports 170a-c may act as one or more logical 60 the particular content. Another database maintained by the 

ingress ports, and each of the physical egress ports 165<z-c CSD contains server records, each of which contains infor- 

may act as one or more logical egress ports in the procedures mation about a particular server. The server record for a 

described below. Each of the web servers lOOo-c is network server contains, for example, the server's IP address, 

accessible to the content-aware flow switch 110 via one or protocol, a port of the server through which the server can 

more of the physical egress ports 165o-c. Associated with 65 be accessed by the flow switch U0, an indication of whether 

each flow controlled by the flow switch 110 is a logical the server is local or remote with respect to the flow switch 

ingress port and a logical egress port. 110, and load metrics indicating the load on the server. 
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Information in the CSD is periodically updated from specified by the CSD, and looks for a server that can accept 
various sources, as described in more detail below. The the flow (218). The FACs primary consideration in select- 
WFR, CSD, and FAC are responsible for selecting a server ing a server from the list of candidate servers is that 
to service a content request based on a variety of criteria. sufficient port and switch resources be available on the 
The FAC uses server-specific and content -specific informa- 5 content-aware flow switch to support the flow. An accepted 
tion together with client information and QoS requirements flow is assigned either to a VC-pipe or to a flow pipe, as 
to determine whether to admit a flow to the flow switch 110. appropriate. (VC-pipes and flow pipes are described in more 
The I CP is a lightweight HTTP client whose job is to detail below.) The FAC also adjusts flow weights as neces- 
populate the CSD with server and content information by sary to maintain flow pipe bandwidth, 
probing servers for specific content that is not found in the 10 The FAC informs the WFR of which local server, if any, 
CSD during a flow setup. The ICP probes servers for several was chosen to accept the flow, and provides information to 
reasons, including: (1) to locate specific content that is not mc WFR indicating to which specific VC-pipe or flow pipe 
already stored in the CSD, (2) to determine the character- mc flow was assigned (220). The WFR sets up the required 
istics of known content such as its size, (3) to determine network address translations for locally accepted flows so 
relationships between different pieces of content, and (4) to 15 mat future packets within the flow can be modified appro- 
monitor the health of the servers. ICPs on various flow priately (222). If the chosen server is "remote" (not in the 
switches communicate with each other using the IPP, which local server farm) (220), an HTTP redirect is generated (222) 
periodically sends local server load and content information mal causes the client to go to the chosen remote site for 
to neighboring content-aware flow switches. The CCD con- service. 

tains information related to the known capabilities of clients 20 [o addition tQ the st described ab which ^ ^ 

and is populated by sampling specific flows m progress. The of me flow the h shown ^ 

IPA periodically updates the CSD on the internet proximity nG 2 perform othef ^ ^ following 

ot servers and clients. Periodically, the ICP probes the servers lOOa-c front-ended 

A flow setup request may take the form of a TCP SYN by the content-aware flow switch 110 for information 

from a client being forwarded to the WFR (202). The WFR 25 j^g^ing server status and content. This activity may be 

passes the flow setup request to the CSD (204). The CSD undertaken proactively (such as polling for general server 

determines which servers, if any, are available to service the health) or at the request of the CSD. The ICP updates the 

flow request and generates a list of such candidate servers CSD with the results of this search so that future requests for 

(206). This list of candidate servers is ordered based on the same content will receive better service (224). 

configurable CSD preferences. The individual items within 30 ^ Ipp iodically ^ local load and C0Qtent 

this list contain all the information the FAC will ultimately information to neighboring content-aware flow switches, 

need to make flow admission decisions. Data arriving from these peers is evaluated and appropriate 

If more than one server exists in the server farm 150 and updates are sent to the CSD (226). The IPA periodically 

content is not fully replicated among the servers in the server updates the CSD with internet proximity information (228). 

farm, then it may not be possible for the CSD to identify any ^ operation of me components shown in FIG. 2 is now 

candidate servers based upon the receipt of the TCP SYN described in more detail. 

alone. In this case, the CSD returns a NULL candidate server n r ■ ♦ i *u «rr*r» • v . . * 

r * . *u **u ♦ *, * a- ♦ *■ *u . *u Referring to FIG. 3, the WFR services a client content 

list to the WFR with a status indicator requesting that the £11™,. w?u* « „i- , _ , „ t - ^ . . 

rT ,^„ . . , - . , . 1 , request as tollows. When a client sends a content request to 

TCP connection is to be spoofed and that the subsequent „ l. a „ Tor :„ ^ f 0 ™ D CV1NJ nr trnr> r-h-r t u Q 

in-rn r^T^r • * l r j j * *i_ 40 a server in the torm of a TCP SYN or HTTP GET, the 

HTTP GET is to be forwarded to the CSD (212). # . . . . # 4 , . . . a 

v 7 content request is intercepted by the content-aware flow 

If the CSD contains no content records for servers that can switc h 110> w hj c h interprets the request as a request to 

satisfy the received TCP SYN or HTTP GET, a NULL list a flow bctwccn mc client an appropriate server 

is returned to the WFR with a status indicator indicating that ( step 402) ^ CSD ^ que ried for a list of available servers 

the flow request should be rejected (212). If the CSD finds 45 to me conte nt request (step 404). The CSD returns a 

a content record that satisfies the HTTP GET but does not ^ 0 f candidate servers and the status indicator ACCEPT if 

find a record for the specific piece of content requested, a the preferred server is known to be in the local server farm, 

new content record is created containing default values for If me CSD Kimns a status indicator ACCEPT (decision step 

the specific piece of content requested. The new record is 4^ tneo the content request may be served at one of the 

then returned to the WFR (212). In either of these two cases 5Q local 100a _^ &ont - en ded by the flow switch 110. In 

(i.e., the CSD finds no matching records, or the CSD finds m is case> tne FAC is asked to assign a flow for servicing the 

a matching record that does not exacdy match the requested content rcquest to a local xrniU chosen from among the list 

content), the CSD asks the ICP to probe the local servers of candidate servers returned by the CSD (step 408). If the 

(using http "HEAD" operations) to determine where the FAC successfully assigns the flow to a local server (decision 

content is located and to deduce the content's QoS attributes 55 step 412 ), then an appropriate network address translation 

(208). for the flow is set up (step 416), a connection is set up with 

The CSD then asks the CCD for information related to the the appropriate server (using a pre-cached, persistent, or 

client making the request (211). The CCD returns any such newly created connection) (step 426), and the content 

information in the CCD to the CSD (210). The CSD returns request is passed to the server (step 428). 

an ordered list of candidate servers and any client informa- 60 ^ me csd is unable to identify any local servers to serve 

tion obtained from the CCD to the WFR (212). the content request (decision step 406), or if the FAC is 

Depending on the response returned from the CSD, the unable to assign a flow for the content request to a local 

WFR will either: (1) reject, TCP spoof, or redirect the flow server (decision step 412), then if the status indicator 

as appropriate (214), or (2) forward the flow request, the list (returned by either the CSD in step 404 or the FAC in step 

of candidate servers, and any client information to the FAC 65 408) indicates that the flow should be redirected to a remote 

for selection and local setup (216). The FAC evaluates the server (step 410), then the flow is redirected to a remote 

list of servers contained in the content record, in the order server (step 414). If the CSD indicated (in step 404) that the 
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flow should be spoofed (decision step 418), then the client 
TCP request is spoofed (step 420). If the flow cannot be 
assigned to any server, then the flow is rejected with an 
appropriate error (step 422). 

Referring to FIG. 4, the CSD parses a flow setup request 5 
as follows. First, the CSD parses the UR1 representing the 
client content request in order to identify the nature of the 
requested content (step 429). If the reque st is an HTTP 
request, for example, elements of the HTTP header, includ- 
ing the HTTP content-type, are extracted. In the case of a 1Q 
non-HTTP request, the combination of protocol number and 
source/destination port are used to identify the nature of the 
requested content. In the case of an HTTP request, the 
content-type or filename extension is used to deduce a QoS 
class, delay, minimum bandwidth, and frame loss ratio as 15 
shown in Table 1, below. The content-size is used to deter- 
mine the size of the requested flow. Overall flow intensity is 
monitored by the content-aware flow switch 110 by calcu- 
lating the average throughput of all flows. The degree to 
which a particular piece of content served by a server is "hot 2(J 
content" is measured by monitoring the number of hits 
(requests) the content receives. The burstiness of a flow is 
determined by calculating the number of flows per content 
per time unit. 

Identifying the nature of the requested content also 2S 
involves deducing, from the content request and information 
stored in the CSD, the QoS requirements of the requested 
content. These QoS requirements include: 

Bandwidth, defined by the number of bytes of content to 
be transferred over the average flow duration. 30 

Delay, defined as the maximum delay suitable for retriev- 
ing particular content. 

Frame Loss Ratio, defined as the maximum acceptable 
percentage of frame loss tolerated by the particular type of 
content 35 

A QoS class is assigned to a flow based on the flow's 
calculated QoS requirements. Eight QoS classes are sup- 
ported by the flow switch 110. Table 1 indicates how these 
classes might be used. 

40 

TABLE 1 



QoS 


Delay 


Min 


Frame Loss 


Example 




Cass 


(End to End) 


Bandwidth 


Ratio 


Applications 




0 


N/A 


N/A 


10-* 


Control Flows 


45 


1 


<250 ma 


8 KBPS 


10" 8 


Internet Phone 




2 


Interactive 


4KBPS 


10^ 


Distance 

Learning, 

Telemetry, 

streaming 

video/audio 


50 


3 


500 ms 


0-16 Mbps 




Media 
distribution, 
multi-user 
games, 

interactive TV 




4 


Low 


64 KBPS 


Data: 10 _e 


Entertainment, 


55 








Streaming: 10" 4 


traditional fax 


5 


Low 


N/A 


10^ 


Stock Ticker, 
News 




6 


N/A 


N/A 


10"° 


Service 
Distribution, 












Internet 


60 










Printing 


7 


N/A/ 


N/A 


10^ 


Best effort 
traffic (email, 
Internet fax, 
database, etc.) 





After the nature of the requested content has been 
identified, the CSD queries its database for records of 



65 



candidate servers containing the requested content (step 
430). If the CSD cannot find any records in the database to 
satisfy a given content request (decision step 432), the 
ICP/IPP is asked to locate the requested content, in order to 
increase the probability that future requests for the requested 
content will be satisfied (step 446). The CSD then returns a 
NULL list to the WFR with a status indicator indicating that 
the flow request should be rejected (steps 434, 444). 

If one or more matching server records are found 
(decision step 432) and the client request is in the form of a 
HTTP GET (decision step 436), then the CSD determines 
whether any of the existing content records exactly matches 
the requested content (decision step 448). For example, 
consider a content request for http://www.company.coni/ 
document.html. The CSD will consider a content record for 
http://www.company.com/* to be an exact match for the 
content request. The CSD will consider a record for http:// 
www.company.com/ to be a match for the request, but not 
the most specific match. In the case of an exact match, the 
CSD sorts the list of candidate servers (identified in step 
430) based on configurable preferences (step 442). In the 
case of at least one match but no exact matches, the CSD 
creates a new record containing default information 
extracted from the most specific matching record, as well as 
additional information gleaned from the content request 
itself (step 450). This additional information may include the 
QoS requirements of the flow, based on the port number of 
the content request, or the filename extension (e.g., ".mpg" 
might indicate a video clip) contained in the request. The 
CSD asks the ICP/IPP to probe, in the background, for more 
specific information to use for future requests (step 452). 

If one or more server records are found (decision step 
432) and the client content request is in the form of a TCP 
SYN (decision step 436), the mere receipt by the flow switch 
of a TCP SYN may not provide the CSD with enough 
information about the nature of the requested flow for the 
CSD to make a determination of which available servers can 
service the requested flow. For example, the TCP SYN may 
indicate the server to which the content request is addressed, 
but not indicate which specific piece of content is being 
requested from the server. If receipt of a HTTP GET from 
the client is required to identify a server to serve the content 
request (decision step 438), then the CSD returns a NULL 
server list to the WFR with a status indicator requesting that 
the TCP connection be spoofed and that the subsequent 
HTTP GET from the client be forwarded to the CSD (step 
440). 

If the TCP SYN is adequate to identify a server to service 
the content request (decision step 438), then the CSD sorts 
the list of candidate servers (identified in step 430) based on 
configurable preferences (step 442). 

If adequate information was available in the content 
request to generate a list of available servers (decision step 
432) and the request may be serviced by one of the servers 
locally attached to the data switch (decision step 451), then 
the Client Capability Database (CCD) is queried for any 
available information on the capabilities of the requesting 
client (step 453). 

Referring to FIG. 5, given a content request and a list of 
candidate servers, the CSD sorts the list of candidate servers 
as follows. If the CSD content records indicate that the 
requested content is "sticky" (i.e., that a client who accesses 
such content must remain attached to a single server for the 
duration of the transaction between the client and the server, 
which could be comprised of multiple individual content 
requests) (decision step 454), then the CSD searches an 
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internal database to determine to which server this client was Referring to FIG. 8, the CSD assigns status indicators to 

previously "stuck" (step 456). If the CSD rinds no record for the servers in the candidate server list as follows. The first 

this client (decision step 458), then the CSD indicates that server in the candidate server list is selected (step 514). If the 

the request should be rejected (step 464). If the CSD finds selected server should be filtered (decision step 516), then 

a record of this client (decision step 458), then the CSD 5 the selected server is removed from the candidate server list 

creates and returns a list of candidate servers which includes (step 518). Otherwise, the server is evaluated (step 520), and 

only the "sticky" server to which the client was previously ordering rules are applied to the selected server to assign a 

"stuck" (step 460), and indicates that a local server to serve status indicator to the selected server (step 522). If there are 

the content request was found (step 462). If the requested more servers in the candidate server list (decision step 524), 

content is not "sticky" (decision step 454), then the list of 1Q then the next server in the candidate server list is selected 

candidate servers is ordered according to the method of FIG. (step 526), and steps 516-524 are repeated. Otherwise, 

6 (step 456). assignment of status indicators to the servers in the candi- 

Referring to FIG. 6, the CSD orders the list of candidate date server list is complete (step 528). 

servers as follows. The CSD evaluates the requested content Referring to FIG. 9, servers are filtered from the candidate 

according to several criteria (step 468). The CSD filters the 5 server list as follows. If a server has not responded to recent 

candidate server list and orders (sorts) the candidate servers queries (decision step 530), is no longer reachable due to a 

remaining in the candidate server list (step 470). Servers in network topology change (decision step 532), or no longer 

the candidate server list are assigned proximity preferences contains the requested content (indicated by an HTTP 404 

(step 472). error in response to a request for the requested content), then 

If the first server in the sorted list of candidate servers is 2 q me server is flag for removal from the candidate server list 

a remote server (decision step 474), then the CSD assigns a (step 536). 

value of REDIRECT to a status indicator (step 476). If the Referring to FIG. 10, a server in the candidate server list 

first server in the sorted list of candidate servers is a local is evaluated as follows. A variable serverFlag is used to store 

server (decision step 474), then the CSD assigns a value of several flags relating to the server. Flags stored in serverFlag 

ACCEPT to the status indicator (step 478). The CSD returns 25 include RECENT_THIS (indicating that a request was 

the status indicator and the ordered list of candidate servers recently made to the server for the same content as is being 

(step 480). requested by the current content request), RECENT_ 

Referring to FIG. 7, a particular requested content is OTHER (indicating that a request was recently made to the 

evaluated by the CSD as follows. A variable requestFlag is server for content other than the content being requested by 

used to store several flags (values which can be either true 30 the current content request), RECENT_31ANY (indicating 

or false) relating to the requested content. Flags stored in that many distinct requests for content have recently been 

requestFlag include BURSTY (indicating whether the made to the server), LOW_BUFFERS (set to TRUE when 

requested content is undergoing a burst of requests), LONG one or more recent requests have been streamed), 

(indicating that this the request is likely to result in a RECENT_LONG (indicating that one or more of the serv- 

long-lived flow), FREQUENT (indicating that the requested 35 er*s recent flows was long-lived), LOW_PORT_BW 

content is frequently requested), and HI_PRIORITY (indicating that the server's port bandwidth is low), and 

(indicating that the requested content is high priority LOW__CACHE (indicating that the server is low on cache 

content). resources). 

If the current time at which the requested content is being If the server was not recently accessed (decision step 

requested minus the previous time at which the requested 40 540), then none of the flags in serverFlag are set, and 

content was requested is not greater than avglnterval (the evaluation of the server is complete (step 570). Otherwise, 

average period of time between flow requests for the if the server was recently accessed for the same content as 

requested content) (decision step 482), then a variable is being requested by the current content request (decision 

burstLength is assigned a value of zero (step 484) and step 542), then serverFlag is assigned a value of RECENT__ 

requestFlag is assigned a value of zero (step 486). Otherwise 45 THIS (step 546); otherwise, serverFlag is assigned a value 

(decision step 482), the value of the variable burstLength is of RECENT__OTHER (step 548). If there have been many 

incremented (step 488), and if the value of burstLength is recent distinct requests to the server (decision step 550), then 

greater than MIN_3URST_RUN (decision step 490), then the RECENT__MANY flag in serverFlag is set (step 552). If 

avglnterval is recalculated (step 492), and the variable any of the recent requests to the server were streamed 

requestFlag is assigned a value of BURSTY (step 494). 50 (decision step 554), then the LOW _3UFFERS flag of 

MIN_BURST_RUN is a configurable value which indi- serverFlag is set (step 556). If any of the recent requests to 

cates how many sub-avglnterval requests for a given piece the server were long-lived (decision step 558), then the 

of content constitute the beginning of a burst. RECENT_LONG flag of serverFlag is set (step 560). If the 

A variable runTlme is set equal to the current time (step port bandwidth of the server is low (decision step 562), then 

496). A flag requestFlag is used to store several pieces of 55 the LOW PORT_BW flag of serverFlag is set (step 564). If 

information describing the requested content. If the size of the RECENT_OTHER flag of serverFlag is set (decision 

the requested content is greater than a predetermined con- step 566), then the LOW_CACHE flag of serverFlag is set 

stant SMALL_CONTENT (decision step 498), then the (step 568). 

LONG flag in requestFlag is set (step 502). If the requested Referring to FIG. 11, a server in the candidate server list 

content is streamed (decision step 500), then the LONG flag 60 is ordered within the candidate server list as follows. A 

in requestFlag is set (step 502). If the number of hits the variable Status is used to indicate whether the server should 

requested content has received is greater than a predeter- be placed at the bottom of the candidate server list, 

mined constant HOT_CONTENT (decision step 504), then Specifically, if the HI_PRIORITY flag of requestFlag is set 

the FREQUENT flag in requestFlag is set (step 506). If the (decision step 572), then Status is assigned a value according 

requested content has previously been flagged as HIGH__ 65 to FIG. 12 (step 574). If the BURSTY flag of requestFlag is 

PRIORITY (decision step 508), then the HI_PRIORITY set (decision step 576), then Status is assigned a value 

flag in requestFlag is set (step 510). according to FIG. 13 (step 578). If the FREQUENT flag of 
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requestFlag is set (decision step 580), then Status is assigned to the client making the content request (step 472). The 

a value according to FIG. 14 (step 582). If the LONG flag details of sorting by proximity are discussed in more detail 

of requestFlag is set (decision step 584), then Status below with respect to the Internet Proximity Assist (I PA) and 

assigned a value according to FIG. 15 (step 586); otherwise, with respect to FIG. 22. 

Status is assigned a value according to FIG. 16 (step 588). 5 The first server in the candidate server list is examined, 

If the value of Status is not OKAY (decision step 590), then an d if it is local to the content-aware flow switch 110 

the server is considered not optimal and is placed at the (decision step 474), then a variable Status is assigned a value 

bottom of the candidate server list (step 584). Otherwise, the 0 f ACCEPT (step 478), indicating that the content-aware 

server is considered adequate and is not moved within the flow switch 110 can service the requested flow using a local 

candidate server list (step 592). 10 server. Otherwise, Status is assigned a value of REDIRECT 

Referring to FIG. 12, in the case of a request for a flow for (step 476), indicating that the flow request should be redi- 

which the HI PRIORITY flag of requestFlag is set, if the rected to a remote server. 

LOW_CACHE flag of serverFlag is set (decision step 596), The process of deciding whether to create a flow in 

the RECENT_OTHER flag of serverFlag is set (decision response to a client content request is referred to as Flow 

step 598), the LOW_PORT__BW flag of serverFlag is set 15 Admission Control (FAQ. Referring again to FIG J, if the 

(decision step 600), or the RECENT_LONG flag of serv- va i ue Q f status is ACCEPT (decision step 406), then the 

erFlag is set (decision step 602), then Status is assigned a FAC is asked to assign the requested flow to a local server 

value of NOT_OPTIMAL (step 608). Otherwise, Status is ( stcp 408). The FAC admits flows into the flow switch 110 

assigned a value of OKAY (step 604). bascd oa flow QoS requirements and the amount of link 

Referring to FIG. 13, in the case of a request for a flow for 20 bandwidth, flow switch bandwidth, and flow switch buffers, 

which the BURSTY requestFlag is set and the RECENT_ Flow admission control is performed for each content 

THIS serverFlag is not set (decision step 608), and if either request in order to verify that adequate resources exist to 

the LOW__CACHE or RECENT_MANY serverFlag is set service the content request, and to offer the content request 

(decision steps 610 and 612), then Status is assigned a value the level of service indicated by its QoS requirements. If 

of NOT_OPTIMAL (step 616). Otherwise, Status is 25 sufficient resources are not available, the content request 

assigned a value of OKAY (step 614). may be redirected to another site capable of servicing the 

Referring to FIG. 14, a value is assigned to Status in the request or simply be rejected, 

case of a request for a flow which is not bursty and not More specifically, referring to FIG. 17, the FAC assigns a 

frequently requested as follows. Status is assigned a value of 3Q flow to a local server from among an ordered list of 

NOT__OPTIMAL (step 644) if any of the following condi- candidate servers, in response to a content request, as 

tions obtain: (1) the LONG flag of requestFlag is set and the follows. First, the FAC fetches the first server record from 

LOW_BUFFERS and LOW„CACHE flags of serverFlag the list of candidate servers (step 684). If the server record 

are set (decision steps 620, 622, and 624); (2) the is for a local server (decision step 686), and the local server 

RECENT_MANY, RECENT__THIS, and LOW_CACHE 35 can satisfy the content request (decision step 690), then the 

flags of serverFlag are set (decision steps 626, 628, and 630); FAC indicates that the content request has been successfully 

(3) the RECENT„LONG, RECENT_THIS, and LOW_ assigned to a local server (step 694). If the server record is 

CACHE flags of serverFlag are set (decision steps 632, 634, not for a local server (decision step 686), then the FAC 

and 636); or (4) the LONG flag of requestFlag is set and the indicates that the content request should be redirected (step 

LOW_PORT__BW flag of serverFlag is set (decision steps ^ 688). 

638 and 640). Otherwise, Status is assigned a value of Ifthe server record is for a local server (decision step 686) 

OKAY (step 642). that cannot satisfy the content request (decision step 690), 

Referring to FIG. 15, a value is assigned to Status in the and there are more records in the list of candidate servers to 

case of a request for a flow which is non-bursty, frequently evaluate (decision step 696), then the FAC evaluates the next 

requested, and short-lived as follows. Status is assigned a 45 record in the list of candidate servers (step 698) as described 

value of NOT_OPTIMAL (step 664) if any of the following above. If all of the records have been evaluated without 

conditions obtain: (1) the LOW_BUFFERS and LOW_ redirecting the request or assigning the request to a local 

CACHE flags of serverFlag are set (decision steps 646, 648); server, then the content request is rejected, and no flow is set 

(2) the RECENT_XONG, RECENT__OTHER, and LOW_ up for the content request (step 700). 

CACHE flags of serverFlag are set (decision steps 650, 652, 50 Referring to FIG. 18, the FAC attempts to establish a flow 

and 654); or (3) the RECENT__MANY, RECENT_ between a client and a candidate server, in response to a 

OTHER, and LOW__CACHE flags of serverFlag are set client content request, as follows. The FAC extracts, from 

(decision steps 656, 658, and 660). Otherwise, Status is me CSD server record for the candidate server, the egress 

assigned a value of OKAY (step 662). port of the flow switch to which the candidate server is 

Referring to FIG. 16, a value is assigned to Status in the 55 connected. The FAC also extracts, from the content request, 

case of request for flows which are not handled by any of the ingress port of the flow switch at which the content 

FIGS. 12-15 as follows. Status is assigned a value of request arrived (step 726). Using the information obtained in 

NOT__OPTIMAL (step 680) if any of the following condi- step 726 and other information from the candidate server 

tions obtain: (1) the LOW_BUFFERS and LOW_CACHE record, the FAC constructs one or more QoS tags (step 728). 

flags of serverFlag are set (decision steps 666, 668); (2) the 50 A QoS tag encapsulates information about the deduced QoS 

RECENT_MANY and LOW_CACHE flags of serverFlag requirements of an existing or requested flow, 

are set (decision steps 67 and 672); or (3) the RECENT„ [f the requested content is not served by a (physical or 

LONG and LOW„PORT_BW flags of serverFlag are set virtual) web host associated with a flow pipe (decision step 

(decision steps 674 and 676). Otherwise, Status is assigned 730), then the FAC attempts to add the requested flow to an 

a value of OKAY (step 678). 65 existing VC pipe (step 732). A VC pipe is a logical aggre- 

Referring again to FIG. 6, the servers remaining in the gation of flows sharing similar characteristics; more 

candidate server list are sorted again, this time by proximity specifically, all of the flows aggregated within a single VC 
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pipe share the same ingress port, egress port, and QoS minimum bandwidth requirement and higher buffer require- 
requirements. Otherwise, the FAC attempts to add the ments than the given QoS tag (step 768). If the requested 
requested flow to the flow pipe associated with the server content is not to be streamed (decision step 770), then for 
identified by the candidate server record (step 734). Once the each existing QoS tag, the FAC calculates the average 
QoS requirements of a flow have been calculated, they are 5 bandwidth, calculates the TCP window size as TcpW= 
stored in a QoS tag, so that they may be subsequently AvgBW*RTT, and verifies that the TCP window size is at 
accessed without needing to be recalculated. least 4K (the minimum requirement for HTTP transfers) 
Referring to FIG. 19, the FAC constructs a QoS tag from (step 774). If the requested content is to be streamed 
a candidate server record, ingress and egress port (decision step 770), then the FAC examines each existing 
information, and any available client information, as fol- i° QoS tag and excludes those that are not capable of delivering 
lows. If the requested content is not to be delivered using required peak bandwidth PeakBW or burst tolerance 
TCP (decision step 738), then the FAC calculates the mini- tool, as calculated in FIG. 19, steps 746 and 748 (step 772). 
mum bandwidth requirement MinBW of the requested con- The resulting list of QoS tags is then used when aggregating 
tent based on the total bandwidth PortBW available to he toe flow a VC-pipe or flow pipe- 
logical egress port of the flow and the hop latency hopLa- 15 0° e °* me effects of the procedures shown in FIGS. 3-20 
tency (a static value contained in the candidate server ^ mat tnc A° w switch 110 functions as a network address 
record) of the flow, using the formula: translation device. In this role, it receives TCP session setup 

requests from clients, terminates those requests on behalf of 

MinBW-framesize/hopLatency) Formula l the servers, and initiates (or reuses) TCP connections to the 

20 best-fit target server on the client's behalf. For that reason, 

(step 756). If the requested content is to be delivered using two separate TCP sessions exist, one between the client and 

TCP (decision step 738), then the FAC calculates the aver- the flow switch, the other between the flow switch and the 

age bandwidth requirement AvgBW of the requested flow best-fit server. As such, the IP, TCP, and possible content 

based on the size of the candidate server's cache CacheSize headers on packets moving bidirectionally between the 

(contained in the candidate server record), the TCP window 2 s client and server are modified as necessary as they traverse 

size TcpW (contained in the content request), and the round the content-aware flow switch 110. 

trip time RTT (determined during the initial flow pj ow pipe S 

handshake), using the formula: \ content-aware flow switch can be used to front-end 

many web servers. For example, referring to FIG. 1c. the 

AvgBW-mmCCacheSizc, TcpW)/RTT Formula 2 3Q flow switch UQ front ^ nds web ^ 10Otf-C. Each of the 

(step 740). Hie FAC uses the average bandwidth AvgBW ™* servers 100^ may emtody one or more 

and the flow switch latency (a constant) to determine the ^ eb host f (™ s). Associated with each of the 

minimum bandwidth requirement MinBW of the requested ™ s fipm^nded by the flow switch 110 may be a flow 

content using the formula: P** whlch 15 a lo S lcal W^lion of the VWH s flows. 

35 Flow pipes guarantee an individual VWH a configurable 

MmBW«min(AvgBW* MinToAvg, ciientBW) Formula 3 amount of bandwidth through the content- aware flow switch 

110. 

In Formula 3, MinToAvg is the flow switch latency and Referring to FIG. 21a, web servers lOOa-c provide ser- 

clientBW is derived from the maximum segment size (MSS) vice to VWHs lOOd-f as follows. Web server 100a provides 

option of the flow request (step 742). 40 all services to VWH lOOi. Web server 100b provides service 

The content-aware flow switch 110 reserves a fixed to VWH lOOe and a portion of the services to VWH lOOf. 

amount of buffer space for flows. The FAC is responsible for Web server 100c provides service to the remainder of VWH 

calculating the buffer requirements (stored in the variable 100/ Associated with VWHs lOOd-f are flow pipes 784a, 

Buffers) of both TCP and non-TCP flows, as follows. If the 784b, and 784c, respectively. Note that flow pipes lS4a~c 

requested flow is not to be streamed (decision step 744), then 45 are logical entities and are therefore not shown in FIG. 21a 

the flow is provided with a best-effort level of buffers (step as connecting to VWH's XOOd-f or the flow switch 110 at 

758). Streaming is typically used to deliver real-time audio physical ports. 

or video, where a minimum amount of information must be The properties of each of the VWH's 100d-/is configured 

delivered per unit of time. If the content is to be streamed by the system administrator. For example, each of the 

(decision step 744), then the burst tolerance btol of the flow 50 VWH's lOOd-f has a bandwidth reservation. The flow 

is calculated (step 746), the peak bandwidth of the flow is switch 110 uses the bandwidth reservation of a VWH to 

calculated (step 748), and the buffer requirements of the flow determine the bandwidth to be reserved for the flow pipe 

are calculated (step 750). A QoS tag is constructed contain- associated with the VWH. The total bandwidth reserved by 

ing information derived from the calculated minimum band- the flow switch 110 for use by flow pipes, referred to as the 

width requirement and buffer requirements (step 752). The 55 flow pipe bandwidth, is the sum of all the individual flow 

FAC searches for any other similar existing QoS tags that pipe reservations. The flow switch 110 allocates the flow 

sufficiently describe the QoS requirements of the requested pipe bandwidth and shares it among the individual flow 

content (step 754). pipes 7H4a-c using a weighted round robin scheduling 

Referring to FIG. 20, the FAC locates any existing QoS algorithm in which the weight assigned to an individual flow 

tags which are similar enough (in MinBW and Buffers) to 60 pipe is a percentage of the overall bandwidth available to 

the QoS tag constructed in FIG. 19 to be acceptable for this clients. The flow switch 110 guarantees that the average total 

content request, as follows. If the requested content is not to bandwidth actually available to the flow pipe at any given 

be delivered via TCP (decision step 764), then the FAC finds time is not less than the bandwidth configured for the flow 

all QoS tags with a higher minimum bandwidth requirement pipe regardless of the other activity in the flow switch 110 

but with lower buffer requirements than the given QoS tag 65 at the time. Individual flows within a flow pipe are sepa- 

(step 766). If the content is to be delivered via TCP (decision rately weighted based on their QoS requirements. The flow 

step 764), then the FAC finds all QoS tags with a lower switch 110 maintains this bandwidth guarantee by propor- 
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tionaDy adjusting the weights of the individual flows in the 
flow pipe so that the sum of the weights remains constant. 
By policing against over-allocation of bandwidth to a par- 
ticular VWH, fairness can be achieved among the VWH's 
competing for outbound bandwidth through the flow switch 
110. 

Again referring to FIG. 21a, consider the case in which 
the flow switch 110 is configured to provide service to three 
VWH's 100d-/. Suppose that the bandwidth requirements of 
VWH lOOrf^f are 64 Kbps, 256 Kbps, and 1.5 Mbps, 
respectively. The total flow pipe bandwidth reserved by the 
flow switch U0 is therefore 1.82 Mbps. Assume for pur- 
poses of this example that the flow switch 110 is connected 
to the Internet by uplinks 115a-c with bandwidths of 45 
Mbps, 1.5 Mbps, and 1.5 Mbps, respectively, providing a 
total of 48 Mbps of bandwidth to clients. In this example, 
flow pipe 784a is assigned a weight of 0.0013 (64 Kbps/48 
Mbps), flow pipe 7$4b is assigned a weight of 0.0053 (256 
Kbps/48 Mbps), and flow pipe 784c is assigned a weight of 
0.0312 (1.5 Mbps/48 Mbps). As individual flows within flow 
pipes 784fl-c are created and destroyed, the weights of the 
individual flows are adjusted such that the total weight of the 
flow pipe is held constant 

The relationship between flows, flow pipes, and the 
physical ingress ports 170a-c and physical egress ports 
165a-c of the content-aware flow switch 110 is discussed 
below in connection with FIG. 21b. Flows 7S2a-c from 
VWH lOOrf enter the flow switch at egress port 165a. Flows 
7H€a-b from VWH 100c enter the flow switch at egress port 
165b. Flow 786c from VWH loof enters the flow switch at 
egress port 1656. Flows 788a-c from VWH loof enters the 
flow switch from egress port 165c. After entering the flow 
switch 110, the flows IVla-c, 786a-c, and 788a-c are 
managed within their respective flow pipes 784a— c as they 
pass through the switching matrix 790. The switching matrix 
is a logical entity that associates a logical ingress port and a 
logical egress port with each of the flows 782a-c, 786a-<, 
and 788a-c. As previously mentioned, each of the physical 
ingress ports 170a—c may act as one or more logical ingress 
ports, and each of the physical egress ports 16Sa-c may act 
as one or more logical egress ports. FIG. 21b shows a 
possible set of associations of physical ingress ports with 
flow pipes and physical egress ports for the flows 782a-c, 
786fl-c, and 788a-c. 
Internet Proximity Assist 

A client may request content that is available from several 
candidate servers. In such a case, the Internet Proximity 
Assist (I PA) module of the content-aware flow switch 110 
assigns a preference to servers which are determined to be 
"closest" to the client, as follows. 

The Internet is composed of a number of independent 
Autonomous Systems (AS's). An Autonomous System is a 
collection of networks under a single administrative 
authority, typically an Internet Service Provider (ISP). The 
ISPs are organized into a loose hierarchy. A small number of 
"backbone" ISPs exist at the top of the hierarchy. Multiple 
AS's may be assigned to each backbone service provider. 
Backbone service providers exchange network traffic at 
Network Access Points (NAPs), Therefore, network conges- 
tion is more likely to occur when a data stream must pass 
through one or more NAPs from the client to the server. The 
IRA module of the content-aware flow switch 110 attempts 
to decrease the number of NAPs between a client and a 
server by making an appropriate choice of server. 

The IPA uses a continental proximity lookup table which 
associates IP addresses with continents as follows. Most IP 
address ranges are allocated to continental registries. The 
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registries, in turn, allocate each of the address ranges to 
entities within a particular continent. The continental prox- 
imity lookup table may be implemented using a Patricia tree 
which is built based on the IP address ranges that have been 

5 allocated to various continental registries. The tree can then 
be searched using the well-known Patricia search algorithm. 
An IP address is used as a search key. The search results in 
a continent code, which is an integer value that represents 
the continent to which the address is registered. Given the 

lQ current allocations of IP addresses, the possible return values 
are shown in Table 2. 



TABLE 2 



ID 


Continent 


0 


Unknown 


1 


Europe 


2 


North America 


3 


Central and South America 


4 


Pacific Rim 



Additional return values can be added as IP addresses are 
allocated to new continental registries. Given the current 
allocation of addresses, the continental proximity table used 
by the IPA is shown in Table 3. 



IP ADDRESS RANGE 


CONTINENT IDENTIFIER 


0.0.0.0 through 


0 (Unknown) 


192.255.255.255 




193.0.0.0 through 


1 (Europe) 


195.255.255.255 




196.0.0.0 through 


0 (Unknown) 


197.255.255.255 




198.0.0.0 through 


2 (North America) 


199.255.255.255 




200.0.0.0 through 


3 (Central and South America) 


201.255.255.255 




202.0.0.0 through 


4 (Pacific Rim) 


203.255.255.255 




204.0.0.0 through 


2 (North America) 


209.255.255.255 




210.0.0.0 through 


4 (Pacific Rim) 


211.255.255.255 




212.0.0.0 through 


0 (Unknown) 


223.255.255.255 





45 Referring to FIG. 22, the IPA assigns proximity prefer- 
ences to zero or more servers, from a list of candidate servers 
and a client content request, as follows. The IPA identifies 
the continental location of the client (step 800). If the client 
continent is not known (decision step 801), then control 

50 passes to step 812, described below. Otherwise, the IPA 
identifies the continental location of each of the candidate 
servers (step 802) using the continental proximity lookup 
table, described above. If all of the server continents are 
unknown (decision step 803), control passes to step 807, 

55 described below. Otherwise, if none of the candidate servers 
are in the same continent as the client (decision step 804), 
then the IPA does not assign a proximity preference to any 
of the candidate servers (step 806). 
At step 807, the IPA prunes the list of candidate servers to 

60 those which are either unknown or in the same continent as 
the client If there is exactly one server in the same continent 
as the client (decision step 808), then the server in the same 
continent as the client is assigned a proximity preference 
(decision step 810). For purposes of decision steps 804 and 

65 808, a client and a server are considered to reside in the same 
continent if their lookup results match and the matching 
value is not 0 (unknown). 
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If there is more than one server in the same continent as 
the client (decision step 808), then the IPA assigns a prox- 
imity preference to one or more servers, if any, which share 
a "closest" backbone ISP with the client, where "closest" 
means that the backbone ISP can reach the client without 
going through another backbone ISP. A closest-backbone 
lookup table, which may be implemented using a Patricia 
tree, stores information about which backbone AS's are 
closest to each range of IP addresses. An IP address is used 
as the key for a search in the closest-backbone lookup table. 
The result of a search is a possibly empty list of AS's which 
are closest to the IP address used as a search key. 

The IPA performs a query on the closest-backbone lookup 
table using the client's IP address to obtain a possibly empty 
list of the AS's that are closest to the client (step 812). The 
IPA queries the closest-backbone lookup table to obtain the 
AS's which are closest to each of the candidate servers 
previously identified as being in the same continent as the 
client (step 814). The IPA then identifies all candidate 
servers whose query results contain an AS that belongs to the 
same ISP as any AS resulting from the client query per- 
formed in step 812 (step 816). Each of the servers identified 
in step 816 is then assigned a proximity preference (step 
818). 

After any proximity preferences have been assigned in 
either step 810 or 818, the existence of a network path 
between the client and each of the preferred servers is 
verified (step 820). To verify the existence of a network path 
between the client and a server, the content-aware flow 
switch 110 queries the content-aware flow switch that front- 
ends the server. The remote content-aware flow switch either 
does a Border Gateway Protocol (BGP) route table lookup 
or performs a connectivity test, such as by sending a PING 
packet to the client, to determine whether a network path 
exists between the client and the server. The remote content- 
aware flow switch then sends a message to the content- 
aware flow switch 110 indicating whether such a path exists. 
Any server for which the existence of a network path cannot 
be verified is not assigned a proximity preference. Servers to 
which a proximity preference has been assigned are moved 
to the top of the candidate server list (step 822). 

Because multiple AS's may be assigned to a single ISP, an 
ISP-AS lookup table is used to perform step 816. The 
ISP-AS lookup table is an array in which each element 
associates an AS with an ISP. An AS is used as a key to query 
the table, and the result of a query is the ISP to which the key 
AS is assigned. 

Referring to FIG. 23, the invention may be implemented 
in digital electronic circuitry or in computer hardware, 
firmware, software, or in combinations of them. Apparatus 
of the invention may be implemented in a computer program 
product tangibly embodied in a machine-readable storage 
device for execution by a computer processor 1080; and 
method steps of the invention may be performed by a 
computer processor 1080 executing a program to perform 
functions of the invention by operating on input data and 
generating output. The processor 1080 receives instructions 
and data from a read-only memory (ROM) 1120 and/or a 
random access memory (RAM) 1110 through a CPU bus 
1100. The processor 1080 can also receive programs and 
data from a storage medium such as an internal disk 1030 
operating through a mass storage interface 1040 or a remov- 
able disk 1010 operating through an I/O interface 1020. The 
flow of data over an I/O bus 1050 to and from I/O devices 
and the processor 1080 and memory 1110, 1120 is controlled 
by an I/O controller 1090. 

The present invention has been described in terms of an 
embodiment. The invention, however, is not limited to the 
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embodiment depicted and described. Rather, the scope of the 
invention is defined by the claims. 
What is claimed is: 

1. In a network, a method for directing packets between 
5 a client and a server, the method comprising: 

receiving a client request for content via the network; 

deriving, from the client request, content information 
descriptive of a plurality of characteristics of the con- 
tent requested by the client request; 
10 in response to receiving the client request, selecting a 
server from among a set of candidate servers based on 

i) the derived content information; and 

ii) a combination of server metrics obtained after 
receipt of the client request from all available servers 

15 capable of servicing the client request for content; 

subsequently forwarding to the selected server transmis- 
sions originating from the client which are associated 
with the client request for content; and 

2Q subsequently forwarding to the client transmissions origi- 
nating from the selected server which are associated 
with the client request for content 

2. The method of claim 1, wherein the client request is an 
HTTP request. 

25 3. The method of claim 2, wherein deriving content 
information comprises: 

extracting information from at least one portion of an 
HTTP header of the client request. 

4. The method of claim 2, wherein deriving content 
30 information comprises deriving content information based 

on a Universal Resource Locator (URL) included in the 
client request. 

5. The method of claim 2, wherein deriving content 
information comprises deriving content information based 

35 on a filename included in the client request. 

6. The method of claim 5, wherein deriving content 
information based on a filename comprises deriving content 
information based on the filename extension. 

7. The method of claim 2, wherein deriving content 
40 information comprises deriving content information based 

on a port identified in the client request. 

8. The method of claim 2, wherein deriving content 
information comprises deriving content information based 
on query parameters including in the client request 

45 9. The method of claim 8, wherein the query parameters 
comprise Common Gateway Interface (CGI) parameters 
included in a URL of the client request. 

10. The method of claim 2, wherein the client request 
comprises one of the following: an HTTP GET message, an 

50 HTTP HEAD message, an HTTP PUT message, and an 
HTTP POST message. 

11. The method of claim 2, wherein deriving content 
information comprises extracting information from the body 
of the client request. 

55 12. The method of claim 1, wherein the client request is 
a TCP request 

13. The method of claim 1, further comprising: 
obtaining additional information from the client about the 

content requested by the client request; and 
60 wherein the selecting further comprises selecting based on 
the additional information. 

14. The method of claim 1, further comprising: 
obtaining client capability information about the client; 

and 

65 wherein the selecting further comprises selecting the . 
selected server based on the client capability informa- 
tion. 
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15. The method of claim 1, wherein selecting as the server 
comprises: 

determining whether the client request requires persistent 
connectivity with a particular candidate server; 

if the client request requires persistent connectivity with 5 
a particular server, identifying a candidate server with 
which the client is persistently connected for service of 
the client request; 

selecting the identified candidate server. 

16. The method of claim 1, further comprising determin- 
ing whether an active path exists between the client and the 
selected server. 

17. The method of claim 16, wherein determining whether 
an active path exists comprises sending a PING packet to the 
client. 15 

18. The method of claim 16, wherein determining whether 
an active path exists comprises performing a Border Gate- 
way Protocol route table lookup. 

19. The method of claim 16, wherein the location of the 
client comprises a continent in which the client resides. 20 

20. The method of claim 19, wherein the locations of the 
plurality of servers are continents in which the servers 
reside. 

21. The method of claim 16, wherein identifying servers 
that are in the same location as the client comprises: 25 

identifying administrative authorities associated with the 
client; 

identifying, for each of the plurality of servers, adminis- 
trative authorities associated with the server; and 30 

identifying servers associated with an administrative 
authority that is associated with the client. 

22. The method of claim 21, wherein the administrative 
authorities are Internet Service Providers. 

23. The method of claim 1, further comprising 35 
deriving, from the client request, quality of service infor- 
mation descriptive of quality of service requirements of 
the content requested by the client request; and 

wherein the selecting further comprises selecting based on 
the quality of service information. 40 

24. The method of claim 1, wherein the deriving quality 
of service information includes deriving quality of service 
information from the content information. 

25. The method of claim 1, wherein the deriving quality 

of service information includes deriving quality of service 45 
information from a size of the content requested by the client 
request. 

26. The method of claim 1, wherein quality of service 
requirements comprise a bandwidth. 

27. The method of claim 1, wherein quality of service so 
requirements comprise a delay. 

28. The method of claim 1, wherein quality of service 
requirements comprise a frame loss ratio. 

29. The method of claim 1, wherein deriving quality of 
service information comprises deriving quality of service 55 
information from the MIME content type of the client 
request. 

30. The method of claim 1, wherein deriving information 
descriptive of the content comprises deriving information 
descriptive of the content type. 60 

31. The method of claim 1, wherein the selecting further 
comprises selecting based on at least one server metric 
describing at least one expected level of service provided by 
at least one of the candidate servers when serving the 
requested content 65 

32. The method of claim 31, wherein the at least one 
server metric includes: 



one or more metrics selected from the following group: a 
metric descriptive of server availability, a metric 
descriptive of the current load of at least one of the 
candidate servers, a metric descriptive of recent activity 
on at least one of the candidate servers, a metric 
descriptive of network congestion between the client 
and at least one of the candidate servers, a metric 
descriptive of the number of active connections being 
maintained by at least one of the candidate servers, a 
metric descriptive of the response time of at least one 
of the candidate servers, information descriptive of one 
or more previous selections of candidate servers, and 
client-server proximity information descriptive of dis- 
tances between the client and at least one of the 
candidate servers. 

33. The method of claim 32, wherein client-server prox- 
imity information comprises information descriptive of a 
continent in which the client resides and a continent in which 
the server resides. 

34. The method of claim 33, wherein client-server prox- 
imity information further comprises information descriptive 
of an administrative authority associated with the client and 
an administrative authority associated with the server. 

35. The method of claim 34, wherein the administrative 
authorities are Internet Service Providers. 

36. The method of claim 31, wherein the at least one 
server metric includes: 

two or more metrics selected from the following group: a 
metric descriptive of server availability, a metric 
descriptive of the current load of at least one of the 
candidate servers, a metric descriptive of recent activity 
on at least one of the candidate servers, a metric 
descriptive of network congestion between the client 
and at least one of the candidate servers, a metric 
descriptive of the number of active connections being 
maintained by at least one of the candidate servers, a 
metric descriptive of the response time of at least one 
of the candidate servers, information descriptive of one 
or more previous selections of candidate servers, and 
client-server proximity information descriptive of dis- 
tances between the client and at least one of the 
candidate servers. 

37. The method of claim 31, wherein the at least one 
server metric is obtained by querying a database. 

38. The method of claim 31, wherein the at least one 
server metric is obtained by periodically querying servers in 
the Internet Protocol network. 

39. The method of claim 31, wherein the expected level 
of service provided by a candidate server is descriptive of 
whether the candidate server is receiving a burst of requests 
for the content requested by the client request. 

40. The method of claim 31, wherein the expected level 
of service provided by a candidate server is descriptive of 
whether satisfying the client request will result in a short- 
term flow. 

41. The method of claim 31, wherein the expected level 
of service provided by a candidate server is descriptive of 
whether the content requested by the client request has been 
frequently requested in the past. 

42. The method of claim 31, wherein the expected level 
of service provided by a candidate server is descriptive of 
whether the content requested by the client request has a 
high priority. 

43. The method of claim 31, wherein the expected level 
of service provided by a candidate server is descriptive of a 
probability that the content requested by the client request is 
cached by the server. 
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44. The method of claim 31, wherein the expected level 
of service provided by a candidate server is descriptive of 
whether the candidate server has responded to recent que- 
ries. 

45. The method of claim 31, wherein the expected level 5 
of service provided by a candidate server is descriptive of 
whether the candidate server recently responded to a request 
for the content requested by the client request with an 
indication that the content is not served by the candidate 

10 

server. 

46. The method of claim 31, wherein the expected level 
of service provided by a candidate server is descriptive of 
whether the candidate server is reachable. 

47. The method of claim 31, wherein the expected level 15 
of service provided by a candidate server is descriptive of 
whether the candidate server's cache resources are below a 
threshold level. 

48. The method of claim 31, wherein the expected level 

of service provided by a candidate server is descriptive of 20 
whether the candidate server's active network connections 
are below a threshold level. 

49. The method of claim 31, wherein the expected level 
of service provided by a candidate server is descriptive of 
whether the candidate server's network bandwidth is below 25 
a threshold level. 

50. A system for directing a stream of packets between a 
client and a server, the system comprising: 

a plurality of servers; 3Q 
a switch coupled to the plurality of servers by an Internet 

Protocol network through one or more communication 

links, wherein the switch comprises: 

means for receiving a client request for content via the 
Internet Protocol network; 35 

means for deriving, from the client request, content 
information descriptive of a plurality of characteris- 
tics of the content requested by the content request; 

means, responsive to the means for deriving, for select- 
ing a server from among a set of candidate servers 40 
serving the content requested by the client request, 
based on the content information; 

means for subsequently forwarding to the selected 
server transmissions originating from the client 
which are associated with the client request for 4S 
content; and 

means for subsequently forwarding to the client trans- 
missions originating from the selected server which 
are associated with the client request for content. 



51. The system of claim 50, wherein: 

the candidate servers comprise HTTP servers. 

52. A switch in an Internet Protocol network, comprising: 
means for receiving a client request for content via the 

Internet Protocol network; 

means for deriving, form the client request, content infor- 
mation descriptive of a plurality of characteristics of 
the content requested by the content request; 

means, responsive to the means for deriving, for selecting 
a server from among a set of candidate servers serving 
the content requested by the client request, based on the 
content information; 

means for subsequently forwarding to the selected server 
transmissions originating from the client which are 
associated with the client request for content; and 

means for subsequently forwarding to the client transmis- 
sions originating from the selected server which are 
associated with the client request for content. 

53. In an Internet Protocol network, a method for use in 
a network switch, the method directing packets between a 
client and a server, the method comprising: 

receiving an HTTP (HyperText Transfer Protocol) client 
request for content via the Internet Protocol network at 
an ingress port of the switch; 

determining the content requested by the client request 
based on a plurality of characteristics related to por- 
tions of the client request; 

selecting a server from among a set of candidate servers 
based on the determining; 

subsequently forwarding to the selected server packets 
originating from the client which are associated with 
the client request for content via a switch egress port; 
and 

subsequently forwarding to the client transmissions origi- 
nating from the selected server which are associated 
with the client request for content via a switch egress 
port. 

54. The method of claim 53, wherein the determining 
comprises determining based on a URL (Universal Resource 
Locator) included in the request. 

55. The method of claim 53, wherein the determining 
comprises determining based on the requested domain name 
included in the request. 

56. The method of claim 53, wherein the determining 
comprises deterrnining a type of content requested. 
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